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Preface 



French novelist Marcel Proust instructs us that, “a voyage of discovery 
consists, not of seeking new landscapes, but of seeing through new eyes.” 
Nowhere in the practice of education do we need to see through new eyes 
than in the domain of assessment. We have been trapped by our collective 
experiences to see a limited array of things to be assessed, a very few ways 
of assessing them, limited strategies for communicating results and 
inflexible roles of players in the assessment drama. 

This edited book of readings jolts us out of traditional habits of mind 
about assessment. An international team of innovative thinkers relies on the 
best current research on learning and cognition, to describe how to use 
assessment to promote, not merely check for, student learning. In effect, they 
explore a new vision of assessment for the new millennium. The authors 
address the rapidly expanding array of achievement targets students must hit, 
the increasingly productive variety of assessment methods available to 
educators, innovative ways of collecting and communicating evidence of 
learning, and a fundamental redefinition of both students’ and teachers’ roles 
in the assessment process. 

With respect to the latter, special attention is given throughout to what I 
believe to be the future of assessment in education: assessment FOR 
learning. The focus is on student-involvement in the assessment, record 
keeping and communication process. The authors address not only critically 
important matters of assessment quality but also issues related to the impact 
of assessment procedures and scores on learners and their well being. 

Those interested in the exploration of assessments that are placed in 
authentic performance contexts, multidimensional in their focus, integrated 
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into the learning process and open to the benefits of student involvement will 
learn much here. 



Rick Stiggins 

Assessment Training Institute 
Portland, Oregon USA 
January 10, 2003 
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1. INTRODUCTION 

Assessment of student achievement is changing, largely because today's 
students face a world that will demand new knowledge and abilities, and the 
need to become life-long learners in a world that will demand competencies 
and skills not yet defined. In this 21st century, students need to understand 
the basic knowledge of their domain of study, but also need to be able to 
think critically, to analyse, to synthesize and to make inferences. Helping 
students to develop these skills will require changes in the assessment 
culture and the assessment practice at the school and classroom level, as well 
as in higher education, and in the work environment. It will also require new 
approaches to large-scale, high-stakes assessments. 

A growing body of literature describes these changes in assessment 
practice and the development of new modes of assessment. However, only a 
few of them address the critical issue of quality. The paradigm change from 
a testing culture to an assessment culture can only be continued when 
research offers sufficient empirical evidence for the various aspects of 
quality of this new assessment culture (Birenbaum & Dochy, 1996). This 
book intends to contribute to this aim by presenting a series of studies on 
various aspects of quality of new modes of assessment. It starts by 
elaborating on the conceptual framework of this paradigm change and 
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related new approaches in edumetrics related to the development of an 
expanded set of quality criteria. A series of new modes of assessment with 
special emphasis on quality considerations involved in them, are described 
from a research perspective. Finally, recent developments in the field of e- 
assessment and the impact of new technologies will be described. The 
current chapter will introduce the different issues addressed in this book. 



2. CHANGING PERSPECTIVES ON THE NATURE 
OF ASSESSMENT WITHIN THE LARGER 
FRAMEWORK OF LEARNING THEORIES 

During the last decades, the concept of learning has been reformulated 
based on new insights, developed within various related disciplines such as 
cognitive psychology, learning sciences and instructional psychology. 
Effective or meaningful learning is conceived as occurring when a learner 
constructs his or her own knowledge base that can be used as a tool to 
interpret the world and to solve complex problems. This implies that learners 
must be self-dependent and self-regulating, and that they need to be 
motivated to continually use and broaden their knowledge base. Learners 
need to develop strategic learning behaviour, meaning they must master 
effective strategies for their own learning. Finally, learners need meta- 
cognitive skills in order to be able to reflect on their own and others’ 
perspectives. 

These changes in the current views on learning lead to the rethinking of 
the nature of assessment. Indeed, there is currently a large agreement within 
the field of educational psychology as well as across its boundaries that 
learning should be in congruence with assessment (Birenbaum & Dochy, 
1996). This has lead to the raise of the so-called assessment culture. The 
major changes in assessment, as defined by Kulieke, Bakker, Collins, 
Fennimore, Fine, Herman, Jones, Raack, & Tinzmann (1990) are moving 
from testing to multiple assessments, and from isolated to integrated 
assessment. 

We can portray the aspects of assessment in seven continua. This schema 
(figure 1) is mainly based on Kulieke et al. (1990, p.5). 
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Figure E The Characteristics of Assessment on Six Continua 

The first continuum shows a change from decontextualized, atomic tests 
to authentic contextualized tests. In practice, it refers to the shift from the so- 
called objective tests with item formats such as short answer, fill-in blank, 
multiple-choice and true/false to the use of portfolio assessment, project- 
based assessment, performance assessment, etc. The second continuum 
shows a tendency from describing a student’s competence with one single 
measure (a mark) towards portraying a student’s competence by a student’s 
profile based on multiple measures. The third continuum depicts the 
movement from low levels of competence towards high levels of 
competence. This is the move from mainly assessing reproduction of 
knowledge to assessing higher-order skills. The fourth continuum refers to 
the multidimensionality of intelligence. Intelligence is more than cognition; 
it implies certainly meta-cognition, but also affective and social dimensions 
and sometimes psychomotor skills. The fifth continuum concerns the move 
towards integrating assessment into the learning process. To a growing 
extent, the strength of assessment as a tool for dynamic ongoing learning is 
stressed. The sixth continuum refers to the change in responsibilities, not 
only in the learning process but also in the assessment process. The 
increasing implementation of self- and peer assessment are examples of this 
move from teacher to student responsibility. 

Finally, the seventh continuum refers to the shift from the assessment of 
learning towards an equilibrated assessment of learning and assessment for 
learning. Research has shown convincingly that using assessment as a tool 
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for learning, including good and well-timed feedback, leads to better results 
when assessing learning outcomes. 

Tbe chapter of Birenbaum elaborates on this paradigm change in learning 
and assessment. She describes the current perspectives on instruction, 
learning and assessment (ILA) and illustrates this ILA culture with an 
example of a learning environment. 



3. EDUMETRICS AND NEW MODES OF 
ASSESSMENT 

With the increasing implementation of new modes of assessment at all 
levels of education, from state and district assessments to classroom 
assessments, questions are raised about the quality of these new modes of 
assessment (Birenbaum & Dochy, 1996). Edumetric indicators like 
reliability and validity are used traditionally to evaluate the quality of 
educational assessment. The validity question refers to the extent to which 
assessment measures what it purports to measure. Does the content of 
assessment correlate with the goals of education? Reliability was 
traditionally defined as the extent to which a test measures consistently. 
Consistency in test results demonstrated objectivity in scoring: the same 
results were obtained if the test was judged by another person or by the same 
person at another time. The meaning of the concept reliability was 
determined by the then prevailing opinion that assessment needs to fulfil 
above all a selecting function. Fairness in testing was aligned with 
objectivity. Striving to achieve objectivity in testing and comparing scores 
resulted in the use of standardized testing forms, like multiple-choice tests. 
Some well-known scientists state these days that since we uncritically 
searched for the most reliable tests, learning processes of children and 
students in schools are not what we hoped for, or what they need to be. Tests 
have an enormous power in steering learning processes. This might work to 
such an extent that even very reliable tests do elicit unproductive or 
unwanted learning. We can think here of students trying to memorize old test 
items and their answers, students anticipating expected test formats, students 
getting drilled and practicing in guessing, etc. At the EARLI Assessment 
conference in the UK (Newcastle, 2002) a colleague said: “Psychometrics is 
not God; valid and reliable tests do not warrant good learning processes, they 
do not guarantee that we get where we want to. Even worse, the most perfect 
tests could lead to the worst (perhaps superficial) learning”. 

Various researchers like Messick, Finn, Baker and Dunbar have 
criticized the inappropriateness of the traditional psychometric criteria for 
evaluating the quality of new modes of assessment. They pointed out that 
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there was an urgent need for the development of a different or an expanded 
set of quality criteria for new modes of assessment. Although there are 
differences in perspectives between the researchers, four aspects are 
considered as part of a comprehensive strategy for conducting a quality 
evaluation: the validity of the assessment tasks, the validity of assessment 
scoring, the generalizability of the assessment, and the consequential validity 
of the assessment process. Because of research evidence regarding the 
steering effect of assessment, and indicating the power of assessment as a 
tool for learning, the consequential aspect of validity has gained increasing 
interest. The central question is to what extent does an assessment lead to the 
intended consequences or does it produce unintended consequences such as 
test anxiety and teaching to the test. The chapter of Gielen, Dochy & Dierick 
elaborates on these quality issues. They illustrate the consequential validity 
of new modes of assessment from a research perspective. The effect of 
various aspects of new modes of assessment on student learning are 
described: the effect of the cognitive complexity of assessment tasks, of 
feedback (formative function of assessment or assessment for learning), of 
transparent assessment criteria and the involvement of students in the 
assessment process, and the effect of criterion-referenced standards setting. 



4. THE QUALITIES OE NEW MODES OE 
ASSESSMENT 

Since there is a significant consensus about the main features of effective 
learning, and the influence of assessment on student learning, on instruction 
and on curriculum is widely acknowledged, educators, policy makers and 
others are turning to new modes of assessment as part of a broader 
educational reform. The movement away from traditional, multiple-choice 
tests to new modes of assessment has included a wide variety of instruments. 

In alignment with the principle that a student should be responsible for 
his own learning and assessment process, self-assessment strategies are 
implemented. Starting from the perspective that learning is a social process 
and self-reflection is enriched by critical reflection by peers, peer-assessment 
is now widely used. Stressing the importance of meta-cognitive skills, 
student responsibility and the complex nature of competencies, portfolios are 
implemented in a variety of disciplines and at different school levels. One of 
the widely implemented reforms in the instructional process is project-based 
education. In alignment with this instructional approach and in order to 
integrate instruction, learning and assessment (ILA), many schools 
developed project-based assessments. Finally, the emphasis on problem 
solving and on the use of a variety of authentic problems in order to 
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stimulate transfer of knowledge and skills has lead to the development of 
case-based assessment instruments such as the OverAll Test. 

The chapters of Topping, Boekaerts and Minnaert, Davies & Le Mahieu, 
Dori and Segers will elaborate on these new modes of assessment. They will 
present and discuss research studies investigating different quality aspects. 

Topping addresses the issues of validity in scoring and consequential 
validity of self- and peer assessment. He concludes that the validity in 
scoring of self-assessment is lower in comparison with professional teachers 
and more variable. There is more substantial hard evidence that peer 
assessment can result in improvements in the effectiveness and quality of 
learning, which is at least as good as gains from teacher assessment, 
especially in relation to writing. In other areas, the evidence is softer. Of 
course, self and peer assessment are not dichotomous alternatives - one can 
lead to and inform the other. Both can offer valuable triangulation in the 
assessment process and both can have measurable formative effects on 
learning, given good quality implementation. Both need training and 
practice, arguably on neutral products or performances, before full 
implementation, which should feature monitoring and moderation. The 
chapter of Boekaerts and Minnaert adds an alternative perspective to the 
study of the qualities of new modes of assessment. Research (Boekaerts, 
2002) indicates that motivation factors are powerfully present in any form of 
assessment and bias the students' judgment of their own or somebody else's 
performance. In a recent study on the impact of affect on self-assessment, 
Boekaerts (2002, in press) showed that students' appraisal of the demand 
capacity ratio of a mathematics task, before starting on the task, contributed 
a large proportion of the variance explained in their self-assessment at task 
completion. Interestingly, the students’ affect (experienced positive and 
negative emotions during the math task) mediated this effect. Students who 
experienced intense negative emotions during the task underrated their 
performance while students who experienced positive emotions, even in 
addition to negative emotions, overrated their performance. This finding is in 
line with much research in mainstream psychology that has demonstrated the 
effect of positive and negative mood state on performance. In light of these 
results, it is surprising that literature on assessment and on qualities of new 
modes of assessment mainly focus on the assessment of students’ 
performances on product as well as on process level. They do not take into 
account the assessment of potential intervening factors such as students’ 
interest. The chapter of Boekaerts and Minnaert offer insight in the relation 
between three basic psychological needs and interest during the different 
phases of a cooperative learning process. For educators, the research results 
are informative for enhancing the accuracy of the diagnosis of students’ 
performances and their self-assessment. Additionally, the authors present an 
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interesting method for self-assessment of students’ interest and the 
underlying psychological needs. For researchers in the field of assessment, 
the research results presented indicate the importance of measuring student’s 
interest and its underlying psychological needs in order to interpret in a more 
accurate way the validity of students’ self-assessments. 

Based on a series of research studies conducted in science education, 
Dori presents a framework for project-hased assessment. The projects are 
interpreted as an ongoing process integrating instruction, learning and 
assessment. The project-hased assessment is considered as suited for 
fostering and evaluating higher-order thinking skills. In the three studies 
presented, the assessment comprises several assessment instruments, 
including portfolio assessment, community expert assessment hy 
observation, self-assessment, and a knowledge test. The findings of the three 
studies indicate that project-hased assessment indeed fosters higher-order 
thinking skills in comparison with students who experienced traditional 
learning environments. 

Davies & le Mahieu present examples of studies about various quality 
issues of portfolio assessment. Studies indicate that portfolios impact 
positively on learning in terms of increased student motivation, ownership, 
and responsibility. Researchers studying portfolios found that when students 
choose work samples, the result is a deeper understanding of content, a 
clearer focus and better understanding of quality product. Portfolio 
construction involves skills such as awareness of audience, awareness of 
personal learning needs, understanding of criteria of quality and the manner 
in which quality is revealed in their work and compilations of it as well as 
development of skills necessary to complete a task. Besides the effect of 
portfolio-assessment on learning as an aspect of the consequential validity, 
Davies and le Mahieu explore the effect on instruction in a broader context. 
There is evidence that portfolios enrich conversation about learning and 
teaching with students as well as the parents involved. Portfolios designed to 
support schools, districts, and cross-district learning (such as provincial or 
state-level assessment) reflect more fully the kinds of learning being asked 
of students in today’s schools and support teachers and other educators in 
learning more about what learning can look like over time. Looking at 
empirical evidence for the validity in scoring, inter-rater reliability of 
portfolio work samples continues to be a concern. The evaluation and 
classification of results is not simply a matter of right and wrong answers, 
but of inter-rater reliability, of levels of skill and ability in a myriad of areas 
as evidenced by text quality and scored by different people, a difficult task at 
best. Clear criteria and anchor papers assist the process. Experience seems to 
improve inter-rater reliability. 
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The OverAll Test can be seen as an example of case-based assessment, 
widely used in curricula where problem solving is one of the core goals of 
learning and instruction. The research study presented in the chapter of 
Segers explores various validity aspects of the OverAll test. Although there 
is evidence for alignment of the OverAll Test with the curriculum and results 
findings indicate criterion validity, the results of the study of the 
consequential validity indicate some concern. A survey and semi-structured 
interviews with students and staff indicate intended as well as unintended 
effects of the OverAll test on the way students learn, on the perceptions of 
students and teachers of the goals of education and, in particular, assessment. 
From these results, it is clear that, more than the objective learning 
environment, the subjective learning environment, as perceived by the 
students, plays an important role in the effect of the OverAll Test on 
students’ learning. There is evidence that in most learning environments, the 
subjective learning environment plays a major role in determining the 
influence of pre-, post en true assessment effects in learning. Hence, there is 
a lot of work in what we should call “assessment engineering”. 



5. STUDENTS’ PERCEPTIONS OF NEW MODES OF 
ASSESSMENT 

As it is indicated in the chapter of Segers, student learning is subject to a 
dynamic and richly complex array of influences which are both direct and 
indirect, intentional and unintended (Hounsell, 1997). Entwistle (1991) 
found that the factual curriculum, including assessment demands, does not 
direct student learning, but the students’ perceptions. This means that 
investigating the reality as experienced by the students can be a value-added 
to gain insight into the effect of assessment on learning. 

It is widely acknowledged that new modes of assessment can contribute 
to effective learning (Birenbaum & Dochy, 1996). In order to gain insight 
into the underlying mechanism, it seems to be worthwhile to investigate 
students’ perceptions. 

This leads to the question: how do students perceive new modes of 
assessment and what are the influences on their learning. The review study 
of Struyven, Dochy & Janssen as presented in this book evidenced that 
students’ perceptions of assessment and its properties have considerable 
influences on students’ approaches to learning and, more generally, on 
student learning. Vice versa, students’ approaches to learning influence the 
way in which students perceive assessment. Research studies on the relation 
between perceptions of assessment and student learning report on a variety 
of assessment formats such as self- and peer-assessment, portfolio 
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assessment and OverAll Assessment. Aspects of learning taken into 
consideration are for example the perceived level of test anxiety, the 
perceived effect on self-reflection and on engagement in the learning 
process, the perceived effect on structuring the learning process, and the 
perceived effect on deep-level learning. The integration of the assessment in 
the learning and instmction process seems to play a mediating role in the 
relation between perceptions of assessment and effects on learning. 
Furthermore, it was found that students hold strong views about different 
formats and methods of assessment. For example, within conventional 
assessment, multiple choice format exams are seen as favourable assessment 
methods. But when conventional assessment and alternative assessment 
methods are discussed and compared, students perceive alternative 
assessment as being more “fair” than traditional assessment methods. From 
students’ point of view, assessment has a positive effect on their learning and 
is fair when it (Sambell, McDowell, & Brown, 1997): 

• Relates to authentic tasks. 

• Represents reasonable demands. 

• Encourages students to apply knowledge to realistic contexts. 

• Emphasizes the need to develop a range of skills. 

• Is perceived to have long-term benefits. 



6. SETTING STANDARDS IN CRITERION- 

REFERENCED PERFORMANCE ASSESSMENTS 

Many of the new modes of assessment, including so-called “authentic 
assessments”, address complex behaviours and performances that go beyond 
the usual multiple-choice tests. This is not to say that objective testing 
methods cannot be used for the assessment of these complex abilities and 
skills, but constmcted response methods many times present a practical 
alternative. Setting of defensible, valid standards becomes even more 
relevant for the family of constructed response assessments, which include 
extended-response instruments. 

Several methods to carry out standard settings on extended-response 
examinations have been used. In order to also deal with multidimensional 
scales that can be found in extended response examinations the Optimized 
Extended-Response Standard Setting method (OER) was developed 
(Schmitt, 1999). The OER standard setting method uses well-defined rating 
scales to determine the different scoring points where judges will estimate 
minimum passing points for each scale. Recent conceptualisations, such as 
those differentiating between criterion- and construct-referenced assessments 
(William, 1997), present very interesting distinctions between the 
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descriptions of levels and the domains. The method described in the chapter 
hy CascaUar & Cascallar can integrate the conceptualisation, as providing 
both an adequate “description” of the levels, as attained by the “informed 
consensus” of the judges (Schmitt, 1999), as well as a flexible 
“exemplification” of the level inherent in the process to reach the consensus. 
As it has been pointed out, there is an essential need to estimate the 
procedural validity of judgment-based cut-off scores. The OER Standard 
Setting Method suggests a methodology and provides the procedures to 
maintain the necessary degree of consistency to make critical decisions that 
affect examinees in the different settings in which their performance is 
measured against the cut scores set using standard setting procedures. With 
reliability being a necessary but not sufficient condition for validity, it is 
necessary to investigate and establish valid methods for the setting of those 
cut-off points (Flake & Impara, 1996). The general uneasiness with the 
current standard setting methods (Pellegrino, Jones, & Mitchell, 1999) rests 
largely on the fact that setting standards is a judgment process that needs 
well-defined procedures, well-prepared judges, and the corresponding 
validity evidence. This validity evidence is essential to reach the quality 
commensurate with the importance of its application in many settings 
(Hambleton, 2001). Ultimately, the setting of standards is a question of 
values and of the decision-making involved in the evaluation of the relative 
weight of the two types of errors of classification. This chapter addresses 
these issues and describes a methodology to attain the necessary level of 
quality in this type of criterion-referenced assessments. 



7. THE FUTURE OF NEW MODES OF 
ASSESSMENT IN AN ICT- WORLD 

It is likely that the widespread implementation of ICT in the world of 
education will leave its marks on instruction, learning and assessment. 
Braun, in his chapter, presents a framework for the analysis of forces like 
technology, shaping the practice of assessment, comprising three 
dimensions: context, purpose and assets. The direct effect of technology 
mainly concerns the assets dimension, influencing the whole process of 
assessment design and implementation. Technology increases the efficiency 
and effectiveness of identifying the set of feasible assessment designs, 
constructing and generating assessment tasks, delivering authentic 
assessment tasks in a flexible way with various levels of interactivity and 
automated scoring of students constructed responses to the assessment tasks. 
In this respect, technology can enhance the validity of new modes of 
assessment, also in large-scale assessment contexts. Indirectly, technology 
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has an effect on the development of many disciplines, for example cognitive 
science, gradually influencing the constructs and models that influence the 
design of assessments for learning. Additionally, there is a complex interplay 
between technology on the one hand and the political, economical and 
market forces on the other hand. 



8. ASSESSMENT ENGINEERING 

It is obvious that the science of assessment in its current meaning, 
referring to new modes of assessment, assessment for learning, assessment 
of competence, is still in an early phase. Certainly, there is a long way to go, 
but research results point in the same conclusive direction: the effects of 
assessment modalities, assessment formats, and the influence of subjective 
factors in assessment environments are not to be underestimated. Surely, 
recent scientists have given clear arguments for further developments in this 
direction and for avoiding earlier pitfalls such as concluding that 
assessments within learning environments are largely comparable with 
assessment of human intelligence and other psychological phenomena. Also 
within the area of edumetrics, a lot of research is needed in order to establish 
a well-defined but evolving framework, and the corresponding instruments 
within a sound quality assurance policy. The science of assessment 
engineering, trying to fill the gaps we find in aligning learning and 
assessment, requires more research within many different fields. The editors 
of this book hope that this contribution will be another step forward in the 
field. Nexon, my horse, come here; I will take another bottle of Chateau La 
Croix, and we will ride further in the wind. 
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1. INTRODUCTION 

Instruction, learning and assessment are inextricably related and their 
alignment has always been crucial for achieving the goals of education 
(Biggs, 1999). This chapter examines the relationships among the three 
components -instruction, learning, and assessment - in view of the challenge 
that education is facing at the dawn of the 21st century. It also attempts to 
identify major intersections where research is needed to better understand 
the potential of assessment for improving learning. 

The chapter comprises three parts: the first examines the assumptions 
underlying former and current perspectives on instruction, learning, and 
assessment (ILA). The second part describes a learning environment - an 
ILA culture - that illustrates the current perspective. The last part suggests 
new directions for research on assessment that is embedded in such a culture. 

Although the chapter is confined to instruction, learning, and assessment 
in higher education, much of what is discussed is relevant and applicable to 
all levels of education. The term assessment (without indication of a specific 
type) is used throughout the chapter to denote formative assessment, also 
known as classroom assessment or assessment for learning. 
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2. THEORETICAL FRAMEWORK 



2.1 The Challenge 

Characteristic of life at the dawn of the 21st century are rapid changes - 
political, economical, social, esthetical and ethical - as well as rapid 
developments in science and technology. Communication has become the 
infrastructure in the post-industrial society and advances in this area, 
especially in ICT (information communication technology), have changed 
the scale of human activities. Moreover, new conceptions of time and space 
have transcended geographical boundaries and thereby accelerating the 
globalisation process (Bell, 1999). This era, known too as the “knowledge 
age”, is also characterized by the rapidly increasing amount of human 
knowledge, which is expected to go on growing at an even faster pace. 
Likewise, due to the advances in ICT the volume of easily accessed 
information is rapidly increasing. Consequently, making quick adjustments 
and being a life long learner (LLL) are becoming essential capabilities, now 
more than ever, for effective functioning in various areas of life (Glatthorn & 
Jailall, 2000; Jarvis, Holford, & Griffin, 1998; Pintrich, 2000). 

In order to become a life long learner one has to be able to regulate one’s 
learning. There are many definitions of self-regulated learning but it is 
commonly agreed that the notion refers to the degree that students are 
metacognitively, motivationally and behaviourally active in their learning 
(Zimmerman, 1989). The cognitive, metacognitive and resource 
management strategies that self-regulated learners activate in combination 
with related motivational beliefs help them accomplish their academic goals 
and overcome obstacles along the way (Pintrich, 2000; Randi & Corno, 
2000; Schunk & Zimmerman, 1994). The need to continue learning 
throughout life, together with the increasing availability of technological 
means for participating in complex networks of information, resources, and 
instruction, highly benefit self-regulated learners. They can assume more 
responsibility for their learning by deciding what they need to learn and how 
they would like to learn it. 

Bell (1999) notes that “the post industrial society deals with fundamental 
changes in the techno-economic sphere and has its greatest impact in the 
areas of education and work and occupations that are the centres of this 
sphere.” (p. Ixxxiii). Indeed a brief look at the employment ads in the 
weekend newspapers is enough to give an idea of the rapid changes that are 
taking place in the professional working place. These changes mark the 
challenge higher education institutes face, having to prepare their students to 
become professional experts in the new working place. Such experts are 
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required to create, apply and disseminate knowledge and continuously 
construct and reconstruct their expertise in a process of life-long learning. 
They are also required to work in teams and to cooperate with experts in 
various fields (Atkins, 1995; Tynjala, 1999). 

However up-to-date, many higher education institutes do not seem to he 
meeting this challenge. There is a great deal of evidence indicating that 
many university graduates acquire only surface declarative knowledge of 
their discipline rather than deep conceptual understanding so that they lack 
the capacity of thinking like experts in their areas of study (Ramsden, 1987). 
Furthermore, it has been noted that traditional degree examinations do not 
test for deep conceptual understanding (Entwistle & Entwistle, 1991). 

But having specified the challenge, the question is how it can he met. The 
next part of this chapter reviews current perspectives of learning, teaching 
and assessment that offer theoretical foundations for creating powerful 
learning environments that afford opportunities for promoting the required 
expertise. 

2.2 Paradigm Change 

It is commonly acknowledged that in order to meet the goals of education 
an alignment or a high consistency between instruction, learning and 
assessment (ILA) is required (Biggs, 1999). Such an alignment was achieved 
in the industrial era. The primary goal of public education at that time was to 
prepare members of social classes that had previously been deprived of 
formal education for efficient functioning as skilled workers at the assembly 
line. Public education therefore stressed the acquisition of basic skills while 
higher order thinking and intellectual pursuits were reserved for the elite 
ruling class. The ILA practice of public education in the industrial era can be 
summarized as follows: instruction- wise: knowledge transmission; learning- 
wise: rote memorization; and assessment-wise: standardized testing. The 
teacher in the public school was perceived as the “sage on the stage” who 
treated the students as “empty vessels” to be filled with knowledge. Freire 
(1972) introduced the “banking” metaphor to describe this educational 
approach, where the students are depositories and the teacher - the depositor. 
Learning that fits with this kind of instruction is carried out through tedious 
drill and practice, rehearsals and repetitions of what was taught in class or in 
the textbook. The aligned assessment approach is a quantitative one aimed at 
differentiating among students and ranking them according to their 
achievement. This is done by utilizing standardized tests comprising 
decontextualized, psychometrically designed items of the choice-response 
format that have a single correct answer and test mainly low-level cognitive 
skills. As to the responsibilities of the parties involved in the assessment 
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process, the examinees neither participate in the development of the test 
items nor in the scoring process, which remains a mystery to them. 
Moreover, under this testing culture, instruction and assessment are 
considered separate activities, the former being the responsibility of the 
teacher and the latter the responsibility of the measurement expert. 

Theoreticians focusing on theories of mind argue that educational 
practices are premised on a set of beliefs about learners’ minds, which they 
term “folk psychology” (Olson & Bruner, 1997). They term the processes 
required to advance knowledge and understanding in learners “folk 
pedagogy”. Olson and Bruner (1997) claim that teachers’ folk pedagogy 
reflects their folk psychology. They distinguish four models of learners’ 
minds and link them to models of learning and teaching. One of these 
models conceptualises the learner as a knower. The folk psychology in this 
case conceives the learner’s mind as a tabula rasa equipped with the ability 
to learn. The mind is conceived as passive (i.e., a vessel waiting to be filled) 
and any knowledge deposited into it is seen as cumulative. The 
corresponding folk pedagogy conceives the instructional process as 
managing the learner from the outside (i.e., performing teaching by telling.) 
The resemblance between this conceptualisation and the traditional 
perspective on ILA described above is quite obvious. Olson and Bruner 
classify this model as an externalist theory of mind, meaning that its focus is 
on what the teacher can do to foster learning rather than on what the learners 
can do or intend to do. Also implied is a disregard on the part of the teacher 
for the way the learners see themselves, thus aspiring, so Olson and Bruner 
claim, to the objective, detached view of the scientist. 

Indeed, the traditional perspective on ILA is rooted in the empirical- 
analytical paradigm that dominated western thinking from the mid- 18th to 
the mid-20th century. It reflects an empiricist (positivist) epistemological 
stance according to which knowledge is located outside the subject (i.e., 
independent of the knower) and only one reality/truth exists. It is objectively 
observable through the senses and therefore it must be discovered rather than 
created (Cunningham & Fitzgerald, 1996; Cuba, 1990). The traditional 
perspective on ILA is also in line with theories of intelligence and learning 
that share the empirical-analytic paradigm. These theories stress the innate 
nature of intelligence and measure it as a fixed entity that is normally 
distributed in the population. The corresponding theories of learning are the 
behaviourist and associationist (connectionist) ones. 

As was mentioned above, the goals of education in the knowledge age 
have changed, therefore requiring a new perspective for ILA. Indeed such a 
perspective is already emerging. It is rooted in the interpretative or 
constructivist paradigm and reflects the poststructuralist and postmodernist 
epistemologies that have dominated much of the discourse on knowledge in 
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the western world in the past three decades (Cunningham & Fitzgerald, 
1996). According to these epistemologies, there are as many realities as there 
are knowers. If truth is possible it is relative (i.e., true for a particular 
culture). Knowledge is social and cultural and does not exist outside the 
individuals and communities who know it. Consequently, all knowledge is 
considered to be created/constructed rather than discovered. 

The new perspective on I F A is also rooted in new theories of human 
intelligence that stress the multidimensional nature of this construct 
(Gardner, 1983, 1993; Sternberg, 1985) and the fact that it should not be 
treated as a fixed entity. There is evidence that training interventions can 
substantially raise the individual’s level of intellectual functioning 
(Feuerstein, 1980; Sternberg, 1986). According to these theories, intelligence 
is seen as mental self-management (Sternberg, 1986) implying that one can 
learn how to learn. Furthermore, mental processes are believed to be 
dependent upon the social and cultural context in which they occur and to be 
shaped as the learner interacts with the environment. 

What then is the nature of the emerging perspective on ILA? In order to 
answer this question we first briefly review the assumptions underlying 
current perspectives on learning and the principles derived from them. 

2.2.1 Current Perspectives on Learning 

Constructivism is the umbrella under which learning perspectives that 
focus on mind-world relations are commonly grouped. These include 
modem (individual) and post-modern (social) learning theories (Prawat, 
1996). The former, also referred to as cognitive approaches, focus on the 
stmctures of knowledge in learners’ minds and include, among others, the 
cognitive-schema theory (Derry, 1996), Piaget-based radical constructivism 
(Von Glasersfeld, 1995) and the constructivist revision of information 
processing theory (Mayer, 1996). Post-modem social constructivist theories 
on the other hand, reject the notion that the locus of knowledge is in the 
individual. The approaches referred to as situative emphasize the distributed 
nature of cognition and focus on students’ participation in socially organized 
learning activities (Brown, Collins, & Duguid, 1989; Lave & Wenger, 1991). 
Social constructivist theories include, among others, socio-cultural 
constmctivism in the Vygotzkian tradition (Vygotsky, 1978), symbolic 
interactionism (Blumer, 1969; Cobb & Yackel, 1996), Deweyan idea-based 
social constructivism (Dewey, 1925/1981) and the social psychological 
constmctivist approach (Gergen, 1994). 

Common to all these perspectives is the central notion of activity - the 
understanding that knowledge, whether public or individual, is constructed. 
Yet they vary with respect to their assumptions about the nature of 
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knowledge (Phillips, 1995) and the way in which activity is framed (Cobb & 
Yackel, 1996). Despite heated disputes among the various camps or sects, 
the commonalties seem to be growing and some perspectives seem to 
complement each other rather than to clash (Billett, 1996; Cobb & Yackel, 
1996; Ernest, 1999; Fosnot, 1996; Sfard, 1998; Vosniadou, 1996). Recently 
proponents of the cognitive and situative perspectives identified several 
important points on which they judge their perspectives to be in agreement 
(Anderson, Greeno, Reder, & Simon, 2000). Suggesting that the two 
approaches are different routes to the same goal, they declared that both 
perspectives “are fundamentally important in education... they can cast light 
on different aspects of the educational process” (p. 11). A similar conclusion 
was reached by Cobb (1994) who claimed that each of the two perspectives 
“tells half of a good story” (p. 17). Another attempt at reconciling the two 
perspectives, yet from a different stance, was recently made by Packer and 
Goicoechea (2000). They argue that sociocultural and constructivist 
perspectives on learning presume different, and incommensurate, ontological 
assumptions. According to their claim what socioculturists call learning is 
the process of human change and transformation whereas what 
constructivists call learning is only part of that larger process. Yet they state 
that “whether one attaches the label “learning” to the part or to the whole, 
acquiring knowledge and expertise always entails participation in 
relationship and community and transformation both of the person and of the 
social world” (p.237). 

Adhering to the reconciliatory stance, the following is an eclectic set of 
principles of learning and insights, distilled from the various competing 
perspectives, which looks to both schools of thoughts— the individual and 
the social, though with a slight bias towards the latter. 

Learning as active construction 

• Learning is an active construction of meaning by the learner. (Meaning 
cannot be transmitted by direct instruction.) 

• Discovery is a fundamental component of learning. 

• For learning to occur, the learner has to activate prior knowledge, to 
relate new information/experience to it and restructure it accordingly. 

• Learning is strategic. It involves the employment of cognitive and 
metacognitive strategies. (Self-regulated learners develop an awareness 
about when and how to apply strategies and to use skills, they monitor 
their learning process, evaluate and adjust their strategies accordingly.) 

• Reflection is essential for meaningful learning . 

• Learning is facilitated when: the student participates in the learning 
process and has control over its nature and direction about when and 
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how to apply strategies and to use skills, they monitor their learning 
process, evaluate and adjust their strategies accordingly. 

• Reflection is essential for meaningful learning. 

• Learning is facilitated when the student participates in the learning 
process and has control over its nature and direction. 

Learning as a social phenomenon 

• Learning is fundamentally social and derives from interactions with 
others (mind is distributed in society.) Cognitive change results from 
internalising and mentally transforming what is encountered in such 
interactions. 

Learning as context related 

• Learning is situated in a socio-cultural context. (What one learns is 
socially and culturally determined). Both social and individual 
psychological activity are influenced or mediated hy the tools and signs 
in one’s socio-cultural milieu. 

Learning as participation 

• Learning involves a process of enculturation into an established 
community of practice by means of cognitive apprenticeship. 

• “Expertise” in a field of study develops notjust by accumulating 
information, but also by adopting the principled and coherent ways of 
thinking, reasoning, and of representing problems shared by the 
members of the relevant community of practice. 

Learning as influenced by motivation, affect and cognitive 
styles/intelligences 

• What is constructed from a learning encounter is also influenced by the 
learner’s motivation and affect: his/her goal orientation, expectations, 
the value s/he attributes to the learning task, and how s/he feels about it. 

• Learning can be approached using different learning styles and various 
profiles of intelligences. 

Learning as labour intensive engagement 

• The learning of complex knowledge and skills requires extended effort 
and guided practice. 

This mix of tenets represents a view that learning is a process of both 
self-organization and enculturation. Both processes take place as the learner 
participates in the culture and in doing so interacts with other participants. 
This view includes both the metaphor of “acquisition” and the metaphor of 
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“participation” forwarded by Sfard (1998) for expressing prevalent 
frameworks of learning, which, despite and because of, their clashing 
definitions of learning, are both needed to explain that complex 
phenomenon. 

2.2.2 Current Perspectives on Instruction 

Contemporary definitions of good teaching emphasize its central function 
in facilitating students’ learning. For instance, Biggs (1999) defines good 
teaching as “getting most students to use the high cognitive level processes 
that more academic students use spontaneously” (p. 73). Fenstermacher 
(1986) states that “the central task of teaching is to enable the student to 
perform the tasks of learning” (p 39). In order to facilitate learning, as it is 
conceptualised in constructivist frameworks, a paradigm shift in instruction, 
from teaching-focused to learning-focused, is essential. Central to this 
paradigm are concepts of autonomy, mutual reciprocity, social interaction 
and empowerment (Fosnot, 1996). 

The role of the teacher under such a paradigm changes from that of an 
authoritative source of knowledge, who transmits this knowledge in 
hierarchically ordered bits and pieces, to that of a mentor or facilitator of 
learning who monitors for deep understanding. Likewise, the role of the 
student changes from being a passive consumer of knowledge to an active 
constructor of meaning. An important feature of the teaching-learning 
process is the dialogue between the teacher and the students through which, 
according to Freire (1972), the two parties “become jointly responsible for 
the process in which all grow” (p.53). 

Biggs (1999) defines instruction as “a construction site on which students 
build on what they already know”, (p. 72). The teacher, being the manager of 
this “construction site”, assumes various responsibilities depending on the 
objectives of instruction and the specific needs that arise in the course of this 
process of construction. These responsibilities include: supervising, 

directing, counselling, apprenticing, and participating in a knowledge 
building community. 

The learning environment that leads to conceptual development and 
change is rich in meaning-making and social negotiation. In such an 
environment, students are engaged in activity, reflection and conversation 
(Fosnot, 1996). They are encouraged to ask questions, explore, conduct 
inquiries; they are required to hypothesize, to suggest multiple solutions to 
problems; to generate conceptual connections, metaphors, personal insights, 
to reflect, justify, articulate ideas, elaborate, explain, clarify, criticize, etc. 
The learning tasks are authentic and challenging thus stimulating intrinsic 
motivation and fostering student initiative and creativity. Students are 
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offered choice and control over the learning tasks and the classroom ethos is 
marked by a joint responsibility for learning. In this culture, hard work is 
valued and not perceived as a sign of a lack of ability. 

An instructional approach that incorporates such features is problem- 
based learning (PBL). Briefly stated, PBL is a total approach to learning in 
which knowledge is acquired in a working context and is put back to use in 
that context. The starting point for learning is an authentic problem posed to 
the student, who needs to seek discipline-specific knowledge in order to 
solve it. The problems thus define what is to be learnt. Biggs (1999) argues 
“PBL is alignment itself. The objectives stipulate the problems to be solved, 
the main TLA [teaching-learning activity] is solving them, and the 
assessment is seeing how well they have been solved” (p. 207). He 
distinguishes five goals of PBL: (a) structuring functional knowledge; (b) 
developing effective professional reasoning processes; (c) developing self- 
regulated learning skills, (d) developing collaborative skills and (e) 
increasing motivation for learning. Instead of content coverage, students in 
PBL settings learn the skills for seeking out the required knowledge when 
needed. They are required to base decisions on knowledge, to hypothesize, 
to justify, to evaluate and to reformulate - all of which are the kind of 
cognitive activity that is required in current professional practice. 

Emerging technologies of computer supported collaborative learning 
(CSCL) provide increasing opportunities for fostering learning in such an 
environment by creating on-line communities of learners. Computer 
mediated communication (CMC) is one such technology which enables 
electronic conferencing (Harasim, 1989). It offers a dynamic collaborative 
environment in which learners can interact, engage in critical thinking, share 
ideas, defend and challenge each others’ assumptions, reflect on the learning 
material, ask questions, articulate their views, test their interpretations and 
synthesis, and revise and reconstruct their ideas. By fostering 
intersubjectivity among learners, this technology can thus help them 
negotiate meaning, perceive multiple problem-solving perspectives and 
construct new knowledge (Bonk & King, 1998; McLoughlin & Luca, 2000). 

It is well acknowledged that successful implementation of such pedagogy 
entails, for many teachers, a radical change in their beliefs about knowledge, 
knowing and the nature of intelligence as well as in their conceptions of 
learning and teaching. 

Form a theory-of-mind perspective, Olson and Bruner (1997) present two 
models of learners’ minds that bear close resemblance to the perspective of 
this constructivist-based pedagogy. One of these models conceptualises the 
learner as a thinker and the other as an expert. The folk psychology 
regarding the former conceives learners as being able to understand, to 
reason, to reflect on their ideas, to evaluate them, and correct them when 
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needed. Learners, it is claimed, have a point of view, hold more or less 
coherent “theories” about the world and the mind, and can turn beliefs into 
hypotheses to be openly tested. The learner is conceived of as an interpreter 
who is engaged in constructing a model of the world. The corresponding folk 
pedagogy conceives the teacher’s role as that of a collaborator who tries to 
understand what the learners think and how they got there. The learning 
process features a dialogue - an exchange of understanding between teacher 
and learner. The folk psychology regarding the other model - the learner as 
an expert -conceives of the learner’s mind as an active processor of beliefs 
and theories that are formed and revised based on evidence. The learner, so it 
is claimed, recognizes the distinction between personal and cultural 
knowledge. Learning is of the peer collaboration type whereby the learner 
assumes the role of knowledge co-constructor. The corresponding folk 
pedagogy conceives the teacher’s role as an information manager who 
assists the learners in evaluating their beliefs and theories reflectively and 
collaboratively in light of evidence and cultural knowledge. Olson and 
Bruner classify these two models as internalist theories of mind, stating that 
unlike the externalist theories that focus on what the teacher can do to foster 
learning, internalist theories focus “on what the learners can do or what they 
think they are doing, and how learning can be premised on those intentional 
states” (p.25). They further argue that internalist theories aspire to apply the 
same theories to learners as learners apply to themselves, as opposed to the 
objective, detached view espoused by externalist theories. 

2.2.3 Current Perspective on Assessment 

The assessment approach that is aligned with the constructivist-based 
teaching approach is sometimes referred to as assessment culture, as 
opposed to the conservative testing culture (Birenbaum, 1996; Gipps, 1994; 
Wolf , Bixby, Glenn, & Gardner, 1991). While the conservative approach 
reflected a psychometric-quantitative paradigm, the constructivist approach 
reflects a contextual-qualitative paradigm. This approach strongly 
emphasizes the integration of assessment and instruction and focuses on the 
assessment of the process of learning in addition to that of its products. The 
assessment itself takes many forms, all of which are generally referred to by 
psychometricians as “unstandardized assessments embedded in instruction” 
(Koretz, Stecher, Klein, & McCaffrey, 1994). Reporting practices shift from 
single total scores, used in the testing culture for ranking students, to 
descriptive profiles that provide multidimensional feedback for fostering 
learning. In this culture the position of the student with regard to the 
assessment process changes from that of a passive, powerless, often 
oppressed, subject who is mystified by the process, to being an active 




New Insights Into Learning and Teaching 



23 



participant who shares responsibility in the process. Students participate in 
the development of the criteria and the standards for evaluating their own 
performance, they practice self- and peer-assessment and are required to 
reflect on their learning and to keep track of their academic growth. 

These features of the assessment culture make it most suitable for 
formative classroom assessment, which is geared to promote learning, as 
opposed to summative high-stakes (often large-scale) assessment that serves 
accountability as well as certification and selection purposes (Koretz, et al., 
1994; Worthen, 1993). 

Feedback has always been at the heart of formative assessment. 
Metaphorically speaking, if we liken alignment of instruction, learning and 
assessment to a spin top then feedback is the force that spins the top. 
Feedback as a general term has been defined as “information about the gap 
between the actual level and the reference level of a system parameter which 
is used to alter the gap in some way” (Ramaprasad, 1983, p.3). The 
implications for assessment are that in order for it to be formative, the 
information contained in feedback must be of high quality and mindfully 
used. This entails that the learner first realizes the gap between the desired 
goal (the reference) and his/her current level of understanding and identifies 
the causes of this gap, and then acts to close it (Black & Wiliam, 1998; 
Ramaprasad, 1983; Sadler, 1989). Teachers and computer-based 
instructional systems were the providers of feedback in the past. The new 
perspective on assessment stresses the active role of the learner in generating 
feedback. Self- and peer- assessment is therefore highly recommended for 
advancing understanding and promoting self-regulated life long learners 
(Black & Wiliam, 1998). 

New forms of assessment 

The main tools employed in the assessment culture for collecting 
evidence about learning are performance tasks, learning journals and 
portfolios. These tools are well known to the readers and therefore they will 
only be briefly presented, just for the sake of completeness. Unlike most 
multiple-choice items, performance tasks are designed to tap higher order 
thinking such as planning, hypothesizing, organizing, integrating, criticizing, 
drawing conclusions, evaluating, etc.; they are meant to elicit what Perkins 
and Blythe (1994) term “understanding performances”. Typically 
performance on such tasks is not subject to time limitations, and a variety of 
tools, including those used in real life for performing similar tasks, are 
permitted. The tasks are complex, often refer to multidisciplinary contents, 
they have more than one single possible solution or solution path, and are 
loosely structured. This requires the student to identify and clearly state the 
problem. The tasks, typically involving investigations of various types, are 
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meaningful and authentic to the practice in the discipline, and they aim to be 
interesting, challenging and engaging for the students, who often perform 
them in teams. Upon completion of the task, students are frequently required 
to exhibit their understanding in a communicative manner. Analytic or 
holistic rubrics that specify clear benchmarks of performance at various 
levels of proficiency serve the dual purpose of guiding the students as they 
perform the task as well as guiding the raters who evaluate the performance. 
They are also used for self- and peer-assessment. 

Learning journals are used for documenting students’ reflections on the 
material and their learning processes. The learning journal thus promotes the 
construction of meaning as well as contributes valuable evidence for 
assessment purposes. Learning journals can be used to assess the quality of 
knowledge (Sarig, 1996) and the learner’s reflective and metacognitive 
competencies (Birenbaum & Amdur, 1999). 

The portfolio best serves the dual purpose of learning and assessment. It 
is a container that holds a purposeful collection of evidence of the student’s 
learning efforts, process, progress, and accomplishments in (a) given area(s). 
When implementing portfolios it is essential that the students participate in 
the selection of its content, of the guidelines as well as of the criteria for 
assessment (Arter & Spandel, 1992; Birenbaum, 1996). When integrated, 
evidence collected by means of these tools can provide a comprehensive and 
realistic picture of what the student knows and is able to do in a given area. 

2.2.4 Relationships between Assessment and Learning 

Formative assessment is expected to improve learning. It can meet this 
expectation and indeed was found to do so (Black & Wiliam, 1998) but this 
is not a simple process and occasionally it fails. What are the factors that 
might interfere with the process and cause its failure? In order to answer this 
question let’s examine the stages a learner proceeds through from the 
moment s/he is faced with an assessment task until s/he reaches a decision as 
to how to respond to the feedback information. Below are listed the stages 
and what the learner needs to posses/know/do with respect to each of them. 

Stage I: Getting the task 

• Interpret the task in congruence with the teacher’s intended goals. 

• Understand the task requirements and the standard that it is 
addressing. 

• Have a clear idea of what an outcome that meets the standard looks 
like. 

• Know what strategies should be applied in order to successfully 
perform the task. 
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• Perceive the value of the task and he motivated to perform it to the 
best of his/her capability. 

• Be confident in his/her capability to perform the task successfully 
(self-efficacy). 

Stage II: Performing the task 

• Effectively apply the relevant cognitive strategies. 

• Effectively apply metacognitive strategies to monitor and regulate 
performance. 

• Effectively manage time and other relevant resources. 

• Effectively control and regulate his/her feelings. 

• If given a rubric, appropriately interpret its benchmarks. 

• Be determined to invest the necessary efforts to complete the task 
properly. 

Stage III: Appraising performance and generating feedback information 

• Accurately asses (by him/herself or with the help of the teacher 
and/or peers) his/her performance. 

• In case of a gap between the actual performance and the standard - 
understand the goals he/she is failing to attain. 

• Understand what caused the failure. 

• Conceive of mistakes as a springboard toward growth rather than just 
a sign of low ability; consequently, not attaining the goals does not 
affect his/her self-image and self-efficacy. 

• Posses a mastery orientation towards learning. 

• Peel committed to close the gap. 

• State self-referenced goals for pursuing further learning in order to 
close the gap. 

Eearners vary in their profile with respect to the features listed above, 
and consequently formative assessment occasionally fails to improve 
learning. Yet, classroom ethos and other features of the learning 
environment, as well as teachers’ and students’ beliefs about knowledge, 
learning and teaching can reduce this variance thus affecting the rate of 
success of formative assessment (Birenbaum, 2000; Black & Wiliam, 1998). 

To conclude the theoretical framework, here are the main attributes of an 
aligned TI. A system that is suitable for achieving the goals of higher 
education in the knowledge age: Instruction-wise: learning focused; 
learning- wise: reflective- active knowledge construction; assessment-wise: 
contextualized, interpretative and performance-based. 
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3. FROM THEORY TO PRACTICE: AN ILA 
CULTURE 

This part briefly describes an ILA culture based on constructivist 
principles that was created in a graduate course dedicated to alternatives in 
assessment taught by the author. It illustrates the application of methods and 
tools such as those discussed earlier in this chapter and exemplifies that 
instruction, learning and assessment are inextricably bound up with each 
other, making up a whole that is more than the sum of its parts. 

3.1 Aims of the Course 

The ILA culture developed in this two-semester course is aimed to 
introduce the students (most of whom are in-service educators - teachers, 
principals, superintendents — who pursue their master’s or doctoral studies) 
to a culture that is conducive to the implementation of alternative 
assessment. It offers the students an opportunity, through personal 
experience, to deepen their understanding regarding the nature of this type of 
assessment and concerning its role as an integral part of the teaching- 
learning process. At the same time, it offers them a chance to develop their 
own reflective and other self-regulated learning capabilities. Such 
capabilities are expected to support their present and future learning 
processes (Schunk & Zimmerman, 1994) as well as their professional 
practice (Schdn, 1983). 

3.2 Design Features 

The course design is rooted in constructivist notions about knowledge 
and knowing and the derived conceptions of learning and teaching, and it is 
geared to elicit individual and social knowledge construction through 
dialogue and reflection. It uses a virtual environment that complements the 
regular classroom meetings to create a knowledge building community by 
means of asynchronous electronic discussion forums (e-forums). Included in 
this community are also students who took the course in previous years who 
have chosen to remain active members of the knowledge building 
community. Recently the learning environment has been augmented to offer 
enrichment materials. In its current form it includes short presentations of 
various relevant topics with links to references, video clips, power-point 
presentations, Internet resources, examples of learning outcomes from 
previous years, etc. Information is cross-referenced and can be retrieved 
through various links. The learning environment is accessed through a “city 
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map” which is meant to visually orient the learners regarding the scope and 
structure (relationships among concepts) of the assessment domain and its 
intersections with the domains of learning and instruction. Icons that 
resemble public institutions and other functional artefacts of a city help the 
learner navigate while exploring the “ILA City”. 

The features of the culture created in the course include: freedom of 
choice, openness, flexibility, student responsibility, student participation, 
knowledge sharing, responsiveness, support, caring, empowerment, and 
mutual respect. 

3.3 Instruction 

The instruction is learning-focused. Central features of the pedagogy are 
dialogue, reflection and participation. The instructor facilitates students’ 
learning by engaging them in discussions, conversations, collaborative 
inquiry projects, experiments, authentic performance tasks and reflection, as 
well as by modelling good and bad practice illustrated by means of her own 
and other professionals’ behaviour, through video tapes and field trips. 
Feedback and guidance are regularly provided to students as they work on 
their projects. The instructional process is responsive to students’ needs and 
interests. The discussions centre on authentic issues and dilemmas 
concerning assessment but there is no fixed set of topics nor pre-specified 
sequencing. There is no anticipation that all students leave the course with 
the same knowledge base. Rather each student is expected to deepen his/her 
understanding of the aspects that are most relevant to his/her professional 
practice and be able to make educated decisions regarding assessment on the 
basis of the knowledge constructed during the course. 

3.4 Learning 

The learning is reflective-active. Students are engaged in personal-and 
group-knowledge construction by means of group projects, discussions held 
in class and in the e-forums, and learning journals in which they reflect on 
what has been learnt in class and from the assigned reading materials. In the 
e-forums, students share with the other community members reflections they 
have recorded in their learning journals, and they record back in their 
journals the insights gained from the discussion. Students work on their 
projects in teams meeting face-to-face and/or through sub-forums opened for 
each project at the course’s virtual site. Towards the end of the course they 
present their project outcomes in a plenary session and receive written 
feedback from their peers and the instructor. They then use this feedback for 
revising their work and are required to hand in a written response to the 
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feedback along with the final version of their work. During the course 
students study the textbook and assigned papers and they retrieve relevant 
materials for their projects form various other resources. In addition, each 
student is nominated a “web-site reporter” which entails frequent visits to a 
given internet site dedicated to assessment and reporting back to class when 
relevant information regarding the topic under discussion is retrieved. 

3.5 Assessment 

The assessment is performance based, integrated and contextualized. It 
serves both formative and summative purposes. The following three 
products are assessed providing a variety of evidence regarding the learning 
outcomes: 

» Retrospective summary of learning - At course end, students are required 
to review their on-going journal entries and their postings in the e- 
forums and prepare a retrospective summary. The summary is meant to 
convey their understanding regarding the various aspects of assessment 
addressed in the course and their awareness of their personal growth 
with respect to this domain. The holistic rubric, jointly developed with 
the students, comprises the following criteria: quality of knowledge 
(veritability, complexity, applicability), learning disposition (critical, 
motivated), self-awareness of progress, and contribution to the 
knowledge building community. 

• Performance assessment project. Students are required to develop an 
assessment task, preferably an interdisciplinary one. This involves the 
definition of goals for the assessment, formulation of the task, 
development of a rubric for assessing performance, administration of the 
task, analysis of the results, and generation of feedback as well as critical 
evaluation of the quality of the assessment. 

• “Position paper”. This project is introduced as a collective task in which 
the class writes a position paper to the Ministry of Education regarding 
the constructivist-based assessment culture. (It should be noted that this 
is an authentic task given that interest in the new forms of assessment is 
recent among Israeli policy makers.) Students propose controversial 
issues relevant to the field and each team chooses an issue to study and 
write about. The jointly developed analytic rubric for assessing 
performance on this task consists of the following dimensions: issue 
definition and importance, content, sources of information, argument, 
conclusions and implications, and communicability with emphasis on 
audience awareness. 

Students work on their projects throughout the course. The features of the 
assessment process are as follows: 
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• On-going feedback - provided throughout the course by the instructor in 
accordance with each student/group’s particular needs. This feedback 
loop is conducted both through the project e-forum and through face-to- 
face meetings. The other community members have access to the project 
e-forum and are invited to provide feedback or suggestions. 

• Student participation - Students take part in the decision making process 
regarding the assessment. They participate in the development of the 
assessment rubrics and have a say in how their final grades are to be 
weighted. 

• Self- and peer-assessment. The same rubric is used by the instructor and 
by the students. The latter use it to assess self- and peer- performance. 
After the students submit their work for final assessment, including their 
self-assessment, the instructor provides each student with detailed 
feedback and meets the student for an assessment conference if a 
discrepancy between the assessments occurs. 

As to course evaluation, students’ average rating of the course is quite 
high and their written comments indicate that they consider it a very 
demanding yet a profound learning experience. The same can be said from 
the standpoint of the instructor. 



4. DIRECTIONS FOR RESEARCH ON 
ASSESSMENT FOR LEARNING 

Research evidence conclusively shows that assessment for learning 
improves learning. Following a thorough literature review about assessment 
and classroom learning Black and Wiliam (1998) conclude that the gains in 
achievement due to formative assessment “appear to be quit considerable... 
among the largest ever reported for educational interventions” (p. 61). They 
note that these gains were evident where innovations designed to strengthen 
the frequent feedback about learning were implemented. However, Black 
and Wiliam also claim “it is clear that most of the studies in the literature 
have not attended to some of the important aspects of the situations being 
researched” (p.58). Stressing that “the assessment processes are, at heart, 
social processes, taking place in social settings, conducted by, on and for 
social actors” (p.56) they point to the absence of contextual aspects from 
much of the research they reviewed. In other words, the ILA culture in 
which assessment for learning is embedded has yet to be empirically 
investigated. Figure 1 displays some context-related constructs subsumed 
under the ILA culture and their hypothesized interrelationships and impact 
on assessment. Although this mapping by no means captures the entire 
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network of relevant constructs and interrelationships it suffices to illustrate 
the intricacy of such a network. Represented in the mapping are constructs 
related to class regime and climate; learning environment; teachers’ and 
learners’ epistemological beliefs, conceptions of learning, teaching and 
assessment; teachers’ knowledge, skills, and strategies; learners’ motivation, 
competencies and strategies; learning interactions and consequent 
knowledge construction; and finally, assessment strategies and techniques 
with special emphasis on feedback. 

As can be seen in the figure the hypothesized relationships among these 
constructs create a complex network of direct, indirect and reciprocal effects. 
The question, then, arises as to how this intricacy can be investigated. 
Judging by the nature of the key constructs a quantitative approach, 
employing even sophisticated multivariate analyses such as SEM (structural 
equation modelling), will not suffice. A qualitative approach seems a better 
choice. Cultures are commonly studied by means of ethnographic methods, 
but even among those, the conventional ones may not be sufficient. 
Eisenhart (2001) has recently criticized the sole reliance on conventional 
ethnographic methods arguing that such methods and ways of thinking about 
and looking at cultures are not enough if we want to grasp the new forms of 
life, including school culture, in the post-modern era. She notes that these 
forms “seem to be faster paced, more diverse, more complicated, more 
entangled than before” (p. 24). Research aimed at understanding the ILA 
culture will therefore need to incorporate a variety of conventional and non- 
conventional ethnographic methods and perspectives that fit the conditions 
and experiences of such culture. 
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Figure I. Mapping of Constructs Related to the ILA Culture 



Further research is also needed regarding the nature of assessment-related 
constructs in light of recent shifts in their conceptualisation. For instance, 
feedback is currently conceptualised as a joint responsibility of the teacher 
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and the learner (Black & Wiliam, 1998). Consequently, further research is 
needed to better understand the nature of self- and peer-assessment and their 
impact on learning. Relevant research questions might refer to the process 
whereby the assessment criteria are being negotiated; to how students come 
to internalise the standards for good performance; to the accuracy of self- 
and peer-assessments; to how they affect learners’ self-efficacy and other 
related motivational factors, etc. Another related construct whose underlying 
structure deserves further research is conceptions of assessment. Relevant 
research questions might refer to the nature of teachers’ and students’ 
conceptions of good assessment practice and their respective roles in the 
assessment process; to the effect of the learning environment on students’ 
conceptions of assessment; to the relationships between teachers’ 
conceptions of assessment, their mental models and their interpretations and 
use of evidence of learning, etc. It is obvious that in order to answer such 
questions a variety of quantitative and qualitative methods will have to be 
employed. 

Assessment of learning interactions in a virtual learning environment is 
yet another area in which further research is needed due to the accelerated 
dispersion of distance learning in higher education. Questions such as: How 
to efficiently present feedback information during on-line problem solving? 
How to assess students’ contribution to the knowledge building community 
in an e-discussion forum? etc. are examples of a wide variety of timely, 
practical assessment-related questions that need to be properly addressed. 

Another line of research is related to teacher training in assessment. It is 
well acknowledged that most teachers currently employed have not been 
systematically trained in assessment either in teacher preparation programs 
or in professional development while on the job (Popam, 2001). The 
situation is even more acute in higher education where most instructors do 
not receive systematic pedagogical training of any kind. Since their majority 
left school before the new forms of assessment were introduced they have 
never been exposed to this type of assessment. In view of this situation, 
research should be directed at designing effective training interventions 
tailored to the needs of these populations. 

The issues addressed so far relate to formative assessment; however, for 
certain purposes, such as certifying and licensing, there is a need for high- 
stake summative assessment. Quality control issues then become crucial. 
They refer to the accuracy (reliability) of the assessment scores as well as to 
the validity of the inferences drawn on the basis of these scores. Research 
has shown that the new forms of assessment tend to compare unfavourably 
to standardized testing with respect to these psychometric criteria of 
reliability and validity (Dunbar, Koretz, & Hoover, 1991; Koretz, Stecher, 
Klein, & McCaffrey, 1994; Linn, 1994). Consequently, standardized tests 
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are mostly used for high- stake summative assessment. For practical 
purposes this dichotomy between summative and formative assessment is 
prohlematic. Further efforts should therefore he made to conceptualise more 
suitable criteria for quality control with respect to the new forms of 
assessment. This direction complies with criticism raised regarding the 
applicability of psychometric models to this type of assessment (Birenbaum, 
1996; Delandshere & Petrosky, 1994; 1998; Dierick & Dochy, 2001; Moss, 
1994, 1996) and in general to the context of classroom assessment (Dochy & 
Moerkerke, 1997). On this line, principles of an edumetric approach have 
recently been suggested (Dierick & Dochy, 2001) that expand the traditional 
concepts of validity and reliability to include assessment criteria that are 
sensitive to the intricacy of the teaching-learning process. The 
operationalization of these criteria and their applicability will need to further 
be investigated, along with the impact of their implementation. 

In conclusion, it seems that the assessment community has come a long 
way since the new forms of assessment were introduced more than a decade 
ago. It has deepened its understanding regarding the role and potential of 
assessment in the instruction-learning process and its context. These 
understandings, together with the new intriguing options provided by ICT 
(information communication technology), have opened up new horizons for 
research on methods for optimising assessment in the service of learning. 
Embarking on these lines of research will undoubtedly contribute 
significantly to the joint efforts to meet the challenge for higher education in 
the knowledge age. 
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1. INTRODUCTION 

The role of assessment and evaluation in education has been crucial, 
probably since the earliest approaches to formal education. Flowever, much 
more attention has been paid to this role in the last few decades, largely due 
to wider developments in society. The most fundamental change in our 
views of assessment is represented by the notion of assessment as a tool for 
learning (Dochy & Me Dowell, 1997). Whereas in the past, we have seen 
assessment primarily as a means to determine measures and thus for 
certification, there is now a belief that the potential benefits of assessing are 
much wider and impinge on in all stages of the learning process. The new 
assessment culture (Birenbaum & Dochy, 1996) strongly emphasises the 
integration of instruction and assessment. Students play far more active roles 
in the assessment of their achievement. The construction of tasks, the 
development of criteria for the assessment and the scoring of performance 
may be shared or negotiated among teachers and students. The assessment 
takes all kinds of forms such as observations, text- and curriculum- 
embedded questions and tests, interviews, performance assessments, writing 
samples, exhibitions, portfolio assessment, overall assessment. Several labels 
have been used to describe subsets of these assessment modes, with the most 
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common being “direct assessment”, “authentic assessment”, “performance 
assessment” and “alternative assessment”. It is widely accepted that these 
new forms of assessment lead to a number of benefits in terms of the 
learning process: encouraging thinking, increasing learning and also 
increasing students’ confidence (Palchikov, 1986; 1995). 

One could argue that a new assessment culture cannot be evaluated solely 
on the basis of pre-era criteria. To do right to the basic assumptions of these 
assessment forms, the traditionally used psychometric criteria need to be 
expanded, and additional relevant criteria for evaluating the quality of 
assessment need to be developed (Dierick & Dochy, 2001). In this respect, 
the concept “psychometrics” is often replaced by the concept of 
“edumetrics”. 

In this contribution, we will first focus on the criteria that we see as 
necessary to expand the traditional psychometric criteria to evaluate the 
quality of assessments. In a second part, we will outline some of the 
characteristics of new modes of assessment and relate these to their role 
within the consequential validity. 



2. EVALUATING NEW MODES OE ASSESSMENT 
ACCORDING TO THE NEW EDUMETRIC 
APPROACH 

Various authors have recently proposed ways to extend the criteria, 
techniques and methods used in traditional psychometrics in order to 
evaluate the quality of assessments. Within the literature on quality criteria 
for evaluating assessment, a difference can be made between authors, who 
present a more expanded vision on validity and reliability (Cronbach, 1989; 
Kane, 1992; Messick, 1989) and those who propose specific criteria, 
sensitive to the characteristics of new modes of assessment (Fredericksen & 
Collins, 1989; Haertel, 1991; Linn, Baker & Dunbar, 1991). 

If we integrate the most important changes within the assessment field 
with regard to the criteria for evaluating assessment, conducting quality 
assessment inquiry involves a comprehensive strategy that addresses 
evaluating: 

1. The validity of assessment tasks. 

2. The validity of assessment scoring. 

3. The generalizability of assessment. 

4. The consequential validity of assessment. 

During this inquiry, arguments will be found that support or refute the 
construct validity of assessment. Messick (1989) suggested that two 
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questions must be asked whenever a decision about the quality of assessment 
is made. 

First, is the assessment any good as a measure of the characteristics it is 
interpreted to assess? Second, should the assessment be used for the 
proposed purpose? 

In evaluating the first question, evidence of the validity of the assessment 
tasks, of the assessment performance scoring and the generalizability of the 
assessment must be considered. The second question evaluates the adequacy 
of the proposed use (intended and unintended effects), against alternative 
means of serving the same purpose. In the evaluative argument, the evidence 
obtained during validity inquiry will be considered and carefully weighted, 
to reach a conclusion about the adequacy of assessment use for the specific 
purpose. 

In table 1, an overview is given of questions that can be used as 
guidelines to collect supporting evidence for and examining possible threats 
to construct validity. 

Arguments, supportive for the construct-validity of new assessment 
forms, will be outlined shortly below. 

2.1 Validity of Assessment Tasks Used 

Assessment development begins with establishing an explicit conceptual 
framework that describes the construct domain being assessed: content and 
cognitive specifications should be identified. During the first stage, validity 
inquiry judges how well assessment matches the content and cognitive 
specifications of the construct measured (Shavelson, 2002). The defined 
framework can be used as a guideline to select assessment tasks. 

The following aspects are important to consider: First, the tasks used 
must be an appropriate reflection of the construct or, as one should perhaps 
say within the assessment culture, the competence that needs to be assessed. 
Secondly, with regard to the content, it means that the tasks should be 
authentic in that they are representative of the real life problems that occur in 
the knowledge domain assessed. Third, the cognitive level needs to be 
complex, so that the same thinking processes are required than experts use 
for solving domain-specific problems. 

New assessment forms score better on these criteria than so-called 
objective tests, precisely because of their authentic and complex problem 
character (Dierick & Dochy, 2001). 
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Table /. A h'ramcwork for Collecting Supporting Evidence for and Examining Possible 
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2.2 Validity of Assessment Scoring 

The following aspect that needs to he investigated is, whether the scores 
are valid. In this respect, the criterion fairness plays an important role. This 
requires on the one hand that the assessment criteria do fit and are used 
appropriately: they are an adequate reflection of the criteria used hy experts 
and the weight that is given for assessing competence. On the other hand it 
requires that students need to have a fair chance to demonstrate their real 
ability. 

Potential problems are: First of all, relevant assessment criteria can be 
lacking. As a consequence, certain aspects of the competence at stake do not 
get the appropriate attention. Secondly, irrelevant (mostly personally 
preferenced) assessment criteria can he used. Characteristic for the 
assessment culture is that competencies of students are assessed at different 
moments, in different ways, with different modes of assessment and hy 
different assessors. In this case, potential bias in judgement will be 
counterweighted by the various interventions. As a result, the totality of 
assessment will give a more complete picture of the real competence of a 
person than it is the case with a single objective test, where the decision of 
competence is mostly reduced to a single judgement on a single moment. 

2.3 Generalizability of Assessment 

This step in the validating process investigates to which extent 
assessment can be generalised to other tasks that measure the same 
construct. This indicates that score interpretation is reliable and supplies 
evidence that assessment really measures the purport construct. 

Problems that can occur are construct-underrepresentation and construct- 
irrelevant variance. 

Construct underrepresentation means that the assessment is too small, 
through which important construct dimensions cannot be measured. In case 
of construct-irrelevant variance, the assessment is probably too broad and 
contains systematic variance that is irrelevant for measuring the construct 
(Dochy & Moerkerke, 1997). Consequently, one can discuss how broad the 
construct or the purport competence needs to be defined, before a given 
interpretation is reliable and valid. 

Messick arguments that the validated interpretation gives meaning to the 
measure in the particular instance, and evidence on the generality of 
interpretation over time and across groups and settings shows how stable and 
thus reliable that meaning is likely to be. Other authors go much further than 
Messick (1989). Frederickson & Collins (1989), for example, have moved 
away from the idea that assessment can only be called reliable, if the 
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interpretation can be generalised to a broader domain. They use another 
model where the fairness of the scoring is crucial for reliability, but where 
the replicability and generalizability of the performance are not. 

In any case, it can be argued that an assessment using a number of 
authentic, representative tasks, to measure a specific competence, is less 
sensitive to the above mentioned problems. After all, the purported construct 
is directly measured. Authentic means that tasks are realistic, real life tasks. 
For solving these problems presented in the assessment tasks the, often 
complex cognitive activities experts show, are required from students. 

2.4 Consequences of Assessment 

Research into student learning in higher education over a number of 
years has provided considerable evidence to suggest that student behaviour 
and student learning are very much influenced by assessment (Ramsden, 
1992; Marton, Hounsell & Entwistle, N., 1996; Entwistle, 1988; Biggs, 
1998; Prosser & Trigwell, 1999; Scouller, 1998; Thomas & Bain, 1984). 
This influence of assessment can occur on different levels and depends on 
the function of the assessment (summative versus formative). Consequential 
validity, as this aspect of validity is called, addresses this issue. It implies 
investigation of whether the actual consequences of assessment are also the 
expected consequences. This can be made clear and can be brought to 
surface by presenting statements of expected (and unexpected) consequences 
of assessment to the student population, by holding semi-structured key 
group interviews, by recording student time logging (logging the time 
dedicated to assessment) or by administering self-review checklists (Gibbs, 
2002). Using such methods, unexpected effects may also arise. 

The influence of formative assessment (integrated within the learning 
process) is mainly due to the activity of looking back after the completion of 
the assessment task (referred to as “post-assessment-effects”). Feedback is 
the most important cause for these post-assessment-effects. Teachers give 
students information about the quality of their performance and support 
students in reflecting on the learning outcomes and the learning processes 
they are based on. When students have the necessary metacognitive 
knowledge and skills, teacher feedback can be reduced. Students may 
become capable enough to draw conclusions themselves about the quality of 
their learning behaviour (self-generating feedback or internal feedback), 
after, or even during the completion of the assessment task. 

The influence of summative assessment is less obvious, but significant. 
Mostly, post-assessment-effects of summative assessment are small. The 
influence of summative assessment on learning behaviour is mainly pro- 
active, since students tend to adjust their learning behaviour to what they 
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expect to be assessed. These effects can be described as pre-assessment 
effects, since such effects occur before assessment takes place. 

An important difference between the pre- and post assessment effects is 
that the latter are intentional, whereas the first are rather a kind of side- 
effects, there the main purpose of summative assessment is not in the first 
place supporting and sustaining learning (but rather selection and 
certification of students). However, both are important effects, which need 
attention from teachers and instructional designers as part of the evaluation 
of the consequential validity of an assessment. 

Nevo (1995) and Struyf, Vandenberghe and Lens (2001) point to a third 
kind of learning effect from assessment. Students also learn during 
assessment itself, because they often need to reorganise their acquired 
knowledge, use it in different ways to tackle new problems and to think 
about relations between ideas they did not discover yet during studying. 
When assessment stimulates them towards thinking processes of a higher 
cognitive level, it is possible, the authors mention, that assessment itself 
becomes a rich learning experience for students. This of course applies to 
formative, as well as to summative assessment. We can call this the true 
(pure) assessment-effect. Though, the true assessment effect is somewhat of 
a different kind than the two other effects, in that it can provide for learning 
but does not necessarily have a direct effect on learning behaviour, unless 
under the form of self-feedback as discussed earlier. 



3. CHARACTERISTICS OF NEW ASSESSMENT 
FORMS AND THEIR ROLE IN THE 
CONSEQUENTIAL VALIDITY 



3.1 Consequential Validity and Constructivism 

Current perspectives on learning are largely influenced by 
constructivism. The assessment approach that is aligned with constructivist- 
based learning environments is sometimes referred to as assessment culture 
(Wolf, Bixby, Glenn, & Gardner, 1991; Kleinasser, Horsch, & Tustad, 1993; 
Birenbaum & Dochy, 1996). Central aspect of this assessment culture is the 
perception of assessment as a tool for learning. Assessment is supposed to 
support students in active construction of knowledge in context-rich 
environments, in using knowledge to analyse and solve authentic problems, 
in reflection. Learning so defined, is facilitated when students are 
participating in the process of learning and assessment as self-regulated, self- 




46 



Sarah Gielen, Filip Dochy & Sabine Dierick 



responsible learners. Finally, learning is conceived as influenced by 
motivation, affect and cognitive styles. 

The interest for the consequential validity of assessment is in alignment 
with the view of assessment as a tool for learning. Evaluating the 
consequences of assessment is largely influenced by the characteristics of 
the assessment culture as part of the constructivist-based learning and 
teaching approach. This means that the way the consequences of assessment 
are defined is determined by the conceived characteristics of learning. 

In the context of a constructivist-based learning environment, this leads 
to questions for evaluating the consequences of assessment such as: what do 
students understand as requirements for the assessment; how do students 
prepare themselves for learning and for the assessment; what kind of 
learning strategy is used by students; is the assessment related to authentic 
contexts; does the assessment stimulate students to apply their knowledge in 
realistic situations; does the assessment stimulate the development of various 
skills; are long term effects perceived; is effort, instead of mere chance 
actively rewarded; is breath and depth in learning rewarded; is independence 
stimulated by making expectations and criteria explicit; is relevant feedback 
provided for progress; are competencies measured, rather than memorising 
facts. 

In the next section, the unique characteristics of new modes of 
assessment will be related to their effects on students’ learning. 

3.2 Consequential Validity and New Modes of 
Assessment 

New modes of assessment have a positive influence on the learning of 
students, on the one hand by stimulating the desired cognitive skills and on 
the other hand by creating an environment, which has a positive influence on 
the motivation of students. In the following part, for each of these 
characteristics, we will try to unravel how they interact with personality 
features in order to achieve a deep and self-regulated learning behaviour. 

A first characteristic of assessment is the kind of tasks that are used. New 
modes of assessment focus in the first place on assessing students’ 
competencies, such as their ability to use their knowledge in a creative way 
to solve problems (Dochy, 1999). Tasks that are appropriate within new 
modes of assessment can be described as cognitive complex tasks in 
comparison with traditional tests items. Furthermore, assessment tasks are 
characterised as being real problems or authentic representations of problems 
in reality, whereby different solutions can be correct. 

It is assumed that the different assessment demands in new modes of 
assessment will have a different influence on the cognitive strategies used by 
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students. This influence of assessment on learning, called “pre-assessment- 
effects” earlier, is indicated by different authors with different terms 
(“systematic validity”, “the backwash effect” and “the forward function”). 

The cognitive complexity of assessment tasks can have a positive 
influence on the students’ cognitive strategies, via the pre-assessment-effect 
(assuming students hold the appropriate perceptions), or the post- 
assessment-effects (assuming there is proper feedback). There are 
indications that students will apply deeper learning strategies to prepare for a 
complex case-based exam, than for a reproductive multiple-choice exam. 
Students will, for example, look up for more additional information, question 
the content more critically and structure it more personally (McDowell, 
1995; Ramsden, 1988, 1992; Sambell, McDowell, & Brown, 1997; Thomas 
& Bain, 1984; Trigwell & Prosser, 1991; Scouller & Prosser, 1994). The 
effects of the assessment demands on students’ learning are mediated by the 
students’ perceptions of these assessment demands. Research shows that 
there are differences between students in their capability clearly identify the 
nature and substance of assessment demands. Some are very adept and 
energetic in figuring out optimum strategies for obtaining high marks 
economically (students with an achieving approach), while others are less 
active, but take carefully note of any cues that come their way and a minority 
are cue deaf (Miller & Parlett, 1974). Entwistle (2000a, b) uses the terms 
“strategic” and “apathic” approach to indicate this difference in 
identification capacity of assessment demands. He poses that sensitivity to 
the context is required to make the best use of the opportunities for learning 
and to interpret the often-implicit demands by assessment tasks. To 
streamline the correct perception of assessment demands and the appropriate 
learning behaviour appears to be critical for the learning result. Nevertheless, 
assessment can also fulfil a supportive role. The transparency of assessment 
that is especially directed at clarifying the assessment expectations towards 
students, is one of the basic features of new assessment forms, which will be 
discussed further. 

However, even when students correctly identify the assessment demands, 
they may not always be capable of adapting to it. Several studies (Martin & 
Ramsden, 1987; Marton & Saljo, 1976; Ramsden, 1984; Van Rossum & 
Schenk, 1984) have shown that students who generally use surface 
approaches have great difficulty adapting to assessment requirements that 
favour deep approaches. 

Another feature of assessment tasks, namely their authentic character 
especially contributes to motivating students through the fact that students 
experience the task as more interesting and meaningful, because they realise 
the relevancy and usefulness of it (task-value of assessment) (Dochy & 
Moerkerke, 1997). Moreover, the use of authentic assessment tasks also 
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creates the context to learn real and transferable problem solving skills, to 
practice and to evaluate, since these skills require a delicate interaction of 
different partial skills. Aiming for this interaction of partial skills, by means 
of “well-structured”, predictable and artificial exercises, is ineffective. The 
authenticity of the tasks can thus be considered as an imperative condition to 
achieve the expert level of problem solving. 

A second characteristic of new assessment forms is their formative 
function. The term “formative assessment” is interpreted here as 
encompassing all those activities explicitly undertaken by teachers, and/or 
students, to provide feedback to students to modify their learning behaviour 
in which they are engaged (see Black & Dylan, 1998). Students can also 
obtain feedback during instruction by actively looking for the demands of 
the assessment tasks. This effect is called “backwash feedback” from 
assessment (Biggs, 1998). This kind of feedback is not what we mean here 
with the term “formative assessment”. The term “formative assessment” is 
only used for assessment that is directed at giving information to students 
with and after completing an assignment, and that is explicitly directed at 
supporting, guiding and monitoring their learning process. 

The integration of assessment and learning ensures that students are 
encouraged to study in a more profound way during the course, at a moment 
that there is no time pressure, instead of “quickly learning by heart”. 
(Askham, 1997; Dochy & Moerkerke, 1997; Sambell et ah, 1997; Thomson 
& Palchikov, 1998). It has the advantage that students, via external and 
internal regulation, can get confirmation or corrective input concerning deep 
learning approach (Dochy & Moerkerke, 1997). External regulation refers to 
the assistance of the teacher by giving explicit feedback about their learning 
process and results. Internal regulation of the learning process is stimulated 
when students, based on the received feedback, reflect on the level of 
competency reached and on how they can improve their learning behaviour 
(Askham, 1997). 

Moreover, feedback can also have a positive influence on the intrinsic 
motivation of students. The key factor to obtain these positive effects of 
feedback see ms to be whether students perceive the primary goal of the 
assessment to be controlling their behaviour or providing informative and 
helpful feedback on their progress in learning (Deci, 1975; Keller, 1983; 
Ryan, Connell, & Deci, 1985). Askham (1997) points out that it is an 
oversimplification to think that formative assessment always leads to deep 
level learning and summative assessment to superficial learning. Like other 
authors, he argues that, in order for feedback from assessment to lead to a 
deep learning approach, assessment needs to be embedded in a constructivist 
or powerful learning environment. Birenbaum and Dochy (1996) argue that 
powerful learning environments are characterised by a good balance between 
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discovery learning and personal exploration on the one hand, and systematic 
instruction and guidance on the other, always taking into account the 
individual differences abilities, needs, and motivation among students. By 
giving descriptive feedback- not just a grade- and organising different types 
of follow-up activities, a teacher creates a powerful learning environment. 

A final crucial aspect of the positive influence of feedback is the way it is 
presented to students. Crooks (1998) identifies the following conditions for 
feedback in order to be effective. “First of all, feedback is most effective if it 
focuses on students’ attention to their progress in mastering educational 
tasks” (p. 468469). Therefore, it is necessary that an absolute or self- 
referenced norm is used, so that students can compare actual and reference 
levels of performance and use the feedback information to alter the gap. As 
has been indicated, this is also an essential condition to offer students with a 
normative concept of ability a possibility to realise constructive learning 
behaviour, since this context does not generate competitive feelings among 
them (which make they use defensive learning strategies). “Secondly, 
feedback should be given while it is still clearly relevant. This usually 
implies that it should be provided soon after a task is completed and that the 
student should then be given opportunities to demonstrate learning from 
feedback. Thirdly, feedback should be specific and related to its needs”. 

In short, formative assessment will have a positive influence on the 
intrinsic motivation of students and accelerate and sustain the required (or 
desired) constructive learning processes when it is embedded in a powerful 
learning environment and takes into account some crucial conditions for 
feedback to be effective. 

A third important characteristic of assessment is the transparency of the 
assessment process and student involvement in the assessment process. 
Different authors point out that the transparency of the assessment criteria 
has a positive influence on the students’ learning process. Indeed, “meeting 
criteria improves learning”: if students know exactly which criteria will be 
used when assessing a performance, their performance will improve because 
they know which goals have to be attained (Dochy, 1999). As has been 
previously indicated, making the assessment expectations transparent 
towards students also has a supportive role in the correct interpretation of 
assessment demands that appears to be critical for the learning result 
(Entwistle, 2000a, b). 

An effective way to make assessment transparent to students is to involve 
or engage them in the process of formulating criteria. As a consequence, 
students get a better insight in the criteria and procedures of assessment. 
When on top students are actually involved in the assessment process, and 
thus can experience practically (guided by an “expert evaluator”) what it 
means to evaluate and judge the performance versus the criteria, this forms 
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an additional support for the development of their self-assessment and self- 
regulation skills (Sadler, 1998; Gipps, 1994). Such an exercise in evaluating 
also contributes to the more effective use of feedback, which leads to more 
reflection. Elshout- Mohr (1994) arguments that students are often unwilling 
to give up misunderstandings, they need to be convinced through discussion, 
which promotes their own reflection on their thinking. If a student cannot 
play and carry out systematic remedial learning work for himself, he or she 
will not be able to make use of good formative feedback. This indicates that 
practising the evaluation via peer- of self-assessment is a necessary 
condition to come to reflection and self-regulating learning behaviour. 
Furthermore, Como and Rohrkemper (1985) indicate that self-regulated 
experiences, such as self-assessment, are closely linked with intrinsic 
motivation, presenting evidence that self-regulated learning experiences 
foster intrinsic motivation, and intrinsic motivation in turn encourages 
students to be more independent as learners. 

Also peer-assessment, when used on a formative manner whereby the 
mutual evaluation functions as a support of each other’s learning process, 
can have a positive influence on the intrinsic motivation of students. 

Transparency of the assessment process, by making the judgement 
criteria transparent or furthermore via student involvement, can eventually 
lead to qualitative better learning behaviour. McDowell (1995) states that: 
“Assessment methods which emphasise recalling facts or the repetition of 
procedures are likely to lead to many students to adopt a surface approach. 
But also creating fear, or the lack of feedback about the progress in learning, 
or conflicting messages about what will be rewarded, within the assessment 
system, are factors that bring about the use of a surface approach. On the 
other hand, clearly stated academic expectations and feedback to students are 
more likely to encourage students to adopt a deep approach to learning”. 

The last characteristic of new forms of assessment is the norm that is 
applied. In classical testing relative standard setting has been widely used, 
whereby the achievement of the students is interpreted in relation to his/her 
fellow students. This is considered as an unfair approach within the new 
assessment paradigm (Gipps, 1993). Students cannot verify the 
achievements of other students, and cannot determine their own score. 
Therefore, students do not have sufficient insight in their absolute 
achievement. Assessment should give the student information about the level 
the measured competence is achieved, independent of the achievement of 
others. Within the new assessment paradigm, there is a tendency towards an 
absolute and / or self-referenced norm. The absolute norm is used both for 
formative and summative purposes, the self-referenced is most appropriate 
for formative purposes. In this respect, there is a growing tendency to use 
standard-setting methods where students’ performances are compared with 




The Influence of Assessment on Learning 



51 



levels of proficiency of the skills measured and as defined by experts (see 
chapter 10). 

The use of this kind of norm can give rise to a larger trust of students in 
the result of the assessment, when at least the transparency condition is 
fulfilled. The lack of a comparative norm in judgement ensures also that 
there is less social competition among students, and thus that there are more 
opportunities for collaborative learning (Gipps, 1994). The emphasis on 
informing students about their progress in mastery, rather than on social 
comparison is especially crucial for the less able students, who might 
otherwise receive little positive feedback. Shunk (1984) also remarks that 
learning and evaluation arrangements should be sufficiently flexible to 
ensure suitably challenging tasks for the most capable students, as otherwise 
they would have little opportunity to build their perception of self-efficacy. 
When a self-referenced norm is applied, learning- and assessment tasks can 
be used in a flexible way. Allowing a degree of student autonomy in choice 
of learning activities is a key factor in fostering intrinsic motivation, that, as 
has been discussed above, leads to deeper and more self-regulated learning. 



4. CONCLUSION 

In the so-called “new assessment era”, there is a drive towards using 
assessment as a tool for learning. The emphasis is placed on gradually 
integrating instruction and assessment and on involving students as active 
partners in the assessment process. New developments within this 
framework, such as the development and use of new modes of assessment, 
have led to a reconsideration of quality criteria for assessment. As we argued 
earlier (Dierick & Dochy, 2001), the traditional criteria for evaluating the 
quality of tests need to be expanded in the light of the characteristics of the 
assessment culture. 

In this contribution, we have outlined four criteria that seem necessary 
for evaluating the quality of new modes of assessment: the validity of 
assessment tasks, the validity of the assessment scoring, the generalizability 
of the assessment, and the consequential validity. We outlined the questions 
that should be posed in order to test these four criteria in more detail in our 
“framework for collecting evidence for and examining possible threats to 
construct validity”. 

Finally, we elaborated on the issue of “consequential validity”. Until 
now, psychometrics has been interested in the consequences of testing only 
to a very small extent. The most important issue was the reliability of the 
measurement. Nowadays, within the edumetric approach, the consequences 
of the assessment do play an important role in controlling the quality of the 
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assessment. To a growing extend, research indicates that pre- post- en true 
assessment effects influence the learning processes of students to a large 
extent. If we want students to learn more and become better learners for their 
lifetime, the consequential validity of the assessments is a precious jewel to 
handle with care. Future research on the quality of new modes of 
assessment, addressing the consequential validity, is needed in order to 
justify the widespread use of new modes of assessment. 
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1. INTRODUCTION 

Self-assessment and peer assessment might appear to be relatively new 
forms of assessment, but in fact, they have been deployed in some areas of 
education for many years. For example, George Jardine, professor at the 
University of Glasgow from 1774 to 1826, described a pedagogical plan 
including methods, rules and advantages of peer assessment of writing 
(Gaillet, 1992). By 1999, Hounsell and McCulloch noted that over a quarter 
of assessment initiatives in a survey of higher education (HE) institutions 
involved self and/or peer assessment. Substantial reviews of the research 
literature on self and peer assessment have also appeared (Bond, 1995; Bond 
& Palchikov, 1989; Brown & Dove, 1991; Dochy, Segers, & Sluijsmans, 
1999; Palchikov & Bond, 1989; Palchikov & Goldfinch, 2000; Topping, 
1998). 

Why should teachers, teacher educators and education researchers be 
interested in these developments? Can they enhance quality and/or reduce 
costs? Do they work? Under what conditions? This chapter explores the 
conceptual underpinnings and empirical evidence for the reliability, validity, 
effects, utility, and generalizability of self and peer assessment in schools 
and higher education, and by implication in the workplace and lifelong 
learning. 

All forms of assessment should be fit for their purpose, and the purpose 
of any assessment is a key element in determining its validity and/or 
reliability. The nature and purposes of assessments influence many facets of 
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student performance, including anxiety (Wolfe & Smith, 1995), goal 
orientation (Dweck, 1986), and perceived controllability (Rocklin, 
O'Donnell, & Holst, 1995). Of course, different stakeholders might have 
different purposes. 

Many teachers successfully involve students collaboratively in learning 
and thereby relinquish some control of classroom content and management. 
However, some teachers might be anxious about going so far as to include 
self or peer assessments as part of summative assessment, where 
consequences follow from terminal judgements of accomplishments. By 
contrast, formative or heuristic assessment is intended to help students plan 
their own learning, identify their own strengths and weaknesses, target areas 
for remedial action and develop meta-cognitive and other personal and 
professional transferable skills (Bond, 1990, 2000; Brown & Knight, 1994). 
Triangulating formative feedback through the inclusion of self and peer 
assessment might seem to incur fewer potential threats to quality. 

Reviews have confirmed the utility of formative assessment (e.g.. 
Crooks, 1988), emphasising the importance of quality as well as quantity of 
feedback. Black & Wiliam (1998) concluded that assessment which 
precisely indicated strengths and weaknesses and provided frequent 
constructive individualised feedback led to significant learning gains, as 
compared to traditional summative assessment. The active engagement of 
learners in the assessment process was seen as critical, and self-assessment 
an essential tool in self-improvement. Affective aspects, such as the 
motivation to respond to feedback and the belief that it made a difference, 
were also important. 

However, the new rhetoric on assessment might not be matched by 
professional practice. For example, MacLellan (2001) found that while 
university staff declared a commitment to the formative purposes of 
assessment and maintained that the full range of learning was frequently 
assessed, they actually engaged in practices which militated against 
formative and authentic assessment being fully realised. 

Explorations of self and peer assessment might be driven by a need to 
improve quality or a need to reduce costs. These two purposes are often 
intertwined, since a professional assessor confronted with twice as many 
products to assess in the same time is likely to allocate less time to each unit 
of assessment, with consequent implications for the reliability and validity of 
the professional assessment. A peer assessor with less skill at assessment but 
more time in which to do it might produce an equally reliable and valid 
assessment. Peer feedback might be available in greater volume and with 
greater immediacy than teacher feedback, which might compensate for any 
quality disadvantage. 
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Beyond education, self and peer assessment (or self-improvement 
through self-evaluation prior to peer evaluation) are increasingly found in 
workplace settings (e.g. Bernadin & Beatty, 1984; Farh, Cannella, & 
Bedeian 1991; Fedor & Bettenhausen, 1989; Joines & Sommerich, 2001), 
sometimes in the guise of "Total Quality Management" or "Best Value" 
exercises (e.g., Kaye & Dyason, 1999). The development of such skills in 
school and HE should thus be transferable. University academics have long 
been accustomed to peer assessment of submissions to journals and 
conferences, the reliability and validity of which has been the subject of 
empirical investigation (and some concern) for many years (e.g., Cicchetti, 
1991). Teachers, doctors and other professionals are often assessed by peers 
in vivo during practice. All of us may expect to be peer assessor and peer 
assessee at different times and in different contexts - or as Cicchetti (1982) 
more colourfully phrased it in a paper on peer review: "we have met the 
enemy and he is us" (p. 205). 

Additionally, peer assessment in particular is connected with other forms 
of peer assisted learning in schools and HE. Recent research has 
considerably clarified the many possible varieties of peer assisted learning, 
their relative effectiveness in a multiplicity of contexts, and the 
organisational parameters crucial for effectiveness (Bond, Cohen, & 
Sampson, 2001; Ealchikov, 2001; Topping, 1996a,b; 2001a,b; Topping & 
Ehly, 1998). 

In this chapter, self-assessment is considered first, then peer assessment. 
Eor each practice, a definition and typology of the practice is offered, 
followed by a brief discussion of its theoretical underpinnings. The 
"accuracy", reliability and validity of the practice in schools and higher 
education is then considered. The research findings of the effects of the 
practice are then reviewed in separate sections focused on schools and higher 
education respectively. The research literature was searched online and 
manually and all relevant items included in the database for this systematic 
review, which consequently should have no particular bias (although space 
constraints do not permit mention of every relevant study by name). A 
summary and conclusions section for each practice relates and synthesises 
the findings. Einally, studies directly comparing self and peer assessment are 
considered, followed by an overall summary and conclusion encompassing 
and comparing and contrasting both practices. Evidence-based guidelines for 
quality implementation of self and peer assessment are then given. 
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2. SELF ASSESSMENT 



2.1 Self Assessment - Definition, Typology and Purposes 

Assessment is the determination of the amount, level, value or worth of 
something. Self-assessment is an arrangement for learners and/or workers to 
consider and specify the level, value or quality of their own products or 
performances. 

In self-assessment, the intention is usually to engage learners as active 
participants in their own learning and foster learner reflection on their own 
learning processes, styles and outcomes. Consequently, self-assessment is 
often seen as a continuous longitudinal process, which activates and 
integrates the learner's prior knowledge and reveals developmental pathways 
in learning. In the longer term, it might impact self-management of learning 
- facilitating continuous adaptation, modification and tuning of learning hy 
the learner, rather than waiting for others to intervene. There is evidence that 
graduates in employment regard the ability to evaluate one's own work as a 
crucial transferable skill (e.g., Midgley & Petty, 1983). 

There is a large commercial market in the publication of self-test 
materials or self-administered revision quizzes. These are often essentially 
rehearsal for external summative assessment, are not used under controlled 
or supervised conditions, do not appear to have been rigorously evaluated, 
seem likely to promote superficial, mechanistic and instrumental learning, 
and are not our focus here. However, computerised curriculum-based self 
assessment test programmes which give continuous rich formative feedback 
to learners (often termed "Learning Information Systems") have been found 
effective in raising student achievement in schools (e.g.. Topping, 1999; 
Topping & Sanders, 2000). 

Self-assessment operates in many different curriculum areas or subjects. 
The products, outputs or performances assessed can vary - writing, 
portfolios, oral and/or audio-visual presentations, test performances, other 
skilled behaviours, or combinations of these. Where skilled behaviours in 
professional practice are self-assessed, this might occur via retrospective 
recollection or by post hoc analysis of video recordings. The self-assessment 
can be summative (judging a final product or performance to be 
correct/incorrect or pass/fail, or assigning some quantitative mark or grade) 
and/or (more usually) formative (involving detailed qualitative assessment of 
better and worse aspects, with implications for making specific onward 
improvements). It may be absolute (referred to external objective benchmark 
criteria) or relative (referring to position in relation to the products or 
performances of the current peer group). Boud (1989) explores the issue of 
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whether self-assessment should form part of official student gradings, 
controversial if the practice is assumed to he of uncertain reliability and 
validity, and raising concerns about issues of power and control. 

2.2 Self Assessment - Theoretical Underpinnings 

What does self assessment require from students in terms of cognitive, 
meta-cognitive and social-affective demands? Through what processes might 
these benefit students? Under what conditions these processes might be 
optimised? 

Self-assessment shares some of the characteristics of peer assessment, the 
theoretical underpinnings of which are discussed in detail later. Any form of 
assessment is a cognitively complex undertaking, requiring understanding of 
the goals of the task(s), the criteria for success and the ability to make 
judgements about the relationship of the product or performance to these. 
The process of self-assessment incurs extra time on task and practice. It 
requires intelligent self-questioning - itself cognitively demanding - and is an 
alternative structure for engagement with learning which seems likely to 
promote post hoc reflection. It emphasises learner ownership and 
management of the learning process, and seems likely to heighten the 
learner's sense of personal accountability and responsibility, as well as 
motivation and self-efficacy (Rogers, 1983; Schunk, 1996). All of these 
features are likely to enhance meta-cognition. At first sight self-assessment 
might seem a lonelier activity than peer assessment, but it can lead to 
interaction, such as when discussing assessment criteria or when the learner 
is called upon to justify their self-assessment to a peer or professional tutor. 
Such onward discussions involve constructing new schemata, moderation, 
norm-referencing, negotiation and other social and cognitive demands 
related to the mindful reception and assimilation of feedback. 

2.3 Self Assessment - Reliability and Validity 

This section considers the degree of correspondence between student 
self-assessments and the assessments made of student work by external 
"experts" such as professional teachers. This might be termed "accuracy" of 
self-assessment, if one assumes that expert assessments are themselves 
highly reliable and valid. As this is a doubtful assumption in some contexts 
(see below), it is debatable whether studies of such correspondence should 
be considered to be studies of reliability or validity or both or neither. This 
confusion is reflected in the very various vocabulary used in the literature. 




60 



Keith Topping 



There is evidence that the assessment of student products by 
professionals is very variable (Heywood, 1988; Newstead & Dennis, 1994; 
Newstead, 1996; Rowntree, 1977). Inter-rater reliabilities have been shown 
to vary from 0.40 to 0.63 (fourth- and eighth-grade writing portfolios) 
(Koretz, Stecher, Klein, & Me Caffrey, 1994), through 0.58 to 0.87 (middle 
and high school writing portfolios) (LeMahieu, Gitomer, & Eresh, 1995) and 
0.68 to 0.73 (elementary school writing portfolios) (Supovitz, MacGowan, & 
Slattery, 1997), to 0.76 to 0.94 (elementary school writing portfolios) 
(Herman, Gearhart, & Baker, 1993), varying with the dimensions assessed 
and grade level. This context should condition expectations for the 
"reliability" and "validity" of assessments by learners, in which the 
developmental process is arguably more important than "accuracy". 
However, Longhurst and Norton (1997) showed that tutor grades for an 
essay correlated quite highly (0.69 - 0.88) with deep processing criteria, 
while the correlation between student and tutor grades was lower (0.43). 

For schoolchildren, Barnett and Hixon (1997) found age and subject 
differences in the reliability of self-assessment in school students. Fourth 
graders made relatively accurate predictions in each of three subject areas. 
Second graders were similar except for poor predictions in mathematics. 
Sixth graders made good predictions in mathematics and social studies, but 
not in spelling. Blatchford (1997) found race and gender differences in the 
reliability of self assessment in school pupils aged 7-16 years. White pupils 
were less positive about their own attainments and about themselves at 
school. While black girls showed confidence in their attainments, and had 
the highest attainments in reading and the study of English, white girls 
tended to underestimate themselves and have little confidence. 

In higher education, Falchikov and Bond (1989) reported a meta-analysis 
of self-assessment studies which compared teacher and student marks. The 
degree of correspondence varied widely in different studies, from a low 
correlation coefficient of -0.05 to a high of 0.82, with a mean of 0.39. Some 
studies gave inter-assessor agreement as a percentage, and this varied from 
33% to 99%, with a mean of 64%. Correspondence varied with: design and 
implementation quality of the study (better studies showing higher 
correspondence), level of the course (more advanced learners showing 
higher correspondence), area of study (science subjects showing higher 
correspondence than social science), and nature of product or performance 
(academic products showing higher correspondence than professional 
practice). Self-assessments focusing on effort rather than achievement were 
particularly unreliable. Overall, self-assessed grades tended to be higher than 
staff grades. However, more advanced students tended to under-estimate 
themselves. 
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Boud and Palchikov (1989) conducted a critical analysis of the literature 
on student self-assessment in HE published between 1932 and 1988. The 
methodological quality of studies was generally poor, although later studies 
tended to be better. Some studies made no mention of any explicit criteria. 
Where there were criteria, very many different scales were used. Some 
studies included ratings of student effort (of very doubtful reliability). Self- 
assessment sometimes appeared to be construed as the learner's guess at the 
professional staff assessment, rather than a rationally based independent 
estimate. The context for the learning to be assessed was often insufficiently 
described. Reports of replications were rare. 

There was a tendency for earlier studies to report self-assessor over- 
rating and later studies under-rating. Overall, more able students tended to 
under-rate themselves, and weaker students to over-rate themselves by a 
larger amount. An interesting exception (see Gaier, 1961), found that high 
and low ability students produced more accurate self-assessments than 
middle ranking students. Boud and Palchikov (1989) found that students in 
the later years of courses and graduates tended to generate self-assessments 
more akin to staff assessments than those of students early in courses. 
However, those longitudinal studies which allowed scrutiny of the impact of 
practice in self-assessment over time showed mixed results, four studies 
showing improvement, three studies no improvement. Studies of any gender 
differences were inconclusive. 

More recently, Zoller and Ben-Chaim (1997) found that students over- 
estimated not only their abilities in the subject at hand, but also their abilities 
in self-assessment, as compared to tutor assessments. A review of self- 
assessment in medical education concluded that despite the accepted 
theoretical value of self-assessment, the reliability of the procedure was poor 
(Ward, Gruppen, & Regehr, 2002). However, several later studies have 
shown that the ability of students to assess themselves improves in the light 
of feedback or with time (Birenbaum & Dochy, 1996; Griffee, 1995). Frye, 
Richards, Bradley and Philp (1992) found individual students had a tendency 
towards over- or under-estimation in prediction of examination performance 
that was relatively consistent, but evolved over time with experience, 
maturity and self-assessment practice towards decreased overestimation and 
increased underestimation. Ross (1998) summarised research on self 
assessment, meta-analysing 60 correlations reported in the second-language 
testing literature. Self-assessments and teacher assessments of recently 
instructed ESL learners' functional English skills revealed differential 
validities for self-assessment and teacher assessment depending on the extent 
of learners' experience with the self-assessed skill. 
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2.4 Self Assessment in Higher Education: Effects 

In considering the effects of self-assessment, the question arises of "what 
is a good result?" A finding that learners undertaking self-assessment have 
better outcomes than learners who do not, other things being equal, is clearly 
a "good result". A finding that learners undertaking self-assessment instead 
of professional assessment have outcomes as good as (if not significantly 
better than) learners receiving professional assessment is also arguably a 
"good result". However, a finding that learners undertaking self-assessment 
in addition to professional assessment have outcomes only as good as (and 
not significantly better than) learners receiving only professional assessment 
is not a "good result". 

There are relatively few empirical studies of the effects of self- 
assessment. Davis and Rand (1980) compared the performance of an 
instructor-graded and a self-graded class. Although the self-graded class 
over-estimated, their overall performance was the same as the instructor- 
graded class. This suggests that the formative effects of self-assessment are 
no less than those of instructor grading, with much less effort on the part of 
the instructor. Sobral (1997) evaluated self-assessment of elective self- 
directed learning tasks, finding increased levels of self-efficacy and 
significant relationships to measures of deep approaches to study. Academic 
achievement (Grade Point Average) was significantly higher for 
experimental students than for controls, although not all experimental 
students benefited. 

Marienau (1999) found longitudinal perceptions among adult learners 
that the experience of self-assessment strengthened commitment to 
subsequent competent performance, enhanced higher order skills, and 
fostered self-direction, illustrating that effects might not necessarily be 
immediate. El-Koumy (2001) investigated the effects of self-assessment on 
the knowledge and academic thinking of 94 English as a Eoreign Eanguage 
(EEE) students. Students were randomly assigned to experimental and 
control groups. The self-assessment group was required to assess their own 
knowledge and thinking before and after each lecture, during a semester. 
Both groups were pre- and post-tested on knowledge and academic thinking. 
The experimental group scored higher on both, but differences did not reach 
statistical significance. 

2.5 Self Assessment in Schools: Effects 

Similar caveats about "what is a good result?" apply here. Towler and 
Broadfoot (1992) reviewed the use of self-assessment in the primary school. 
They argued that assessment should mainly be the responsibility of the 
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learner, and that this principle could be realistically applied in education 
from the early years, while emphasising the need for pupil training and a 
whole school approach to ensure quality and consistency. Self-assessment 
has indeed been successfully undertaken with some rather unlikely 
populations in schools, including students with learning disabilities (e.g., 
Lee, 1999; Lloyd, 1982; Miller, 1988) and pre-school and kindergarten 
children (e.g., Boersma, 1995; Mills, 1994). 

Lloyd (1982) compared the effects of self assessment and self-recording 
as interventions for increasing the on-task behaviour and academic 
productivity of elementary school learning disabled students aged 9-10 
years. For this population, self-recording appeared a more effective 
procedure than self-assessment. Miller (1988) noted that learning 
handicapped students tend to be passive learners. For them, self assessment 
included "sizing up" the task before beginning, gauging their own skill level 
and likelihood of success before beginning, continuous self-monitoring and 
assessment during task performance, and consideration of the quality of the 
final product or performance. Self-assessment effectiveness was seen as 
likely to vary according to three sets of parameters: person variables (such as 
age, sex, developmental skills, self-esteem), task variables (such as 
meaningfulness, task format, level of complexity), and strategy variables 
(specific strategy knowledge, relational strategy knowledge, and meta- 
memory). 

Even with pre-school children, portfolios can be used to develop the 
child's own self-assessment skills and give a focus to discussions between 
the child and salient adults. In Mills' (1994) study, portfolios were organised 
around four areas of development: physical, social and emotional, emergent 
literacy, and logico-mathematical. For each area there was a checklist, and 
evidence was included to back up the checklist. At points during the school 
year, a professional met with each parent to discuss the portfolio, results, and 
goals for the child. The portfolio was subsequently made available to the 
child's kindergarten teacher. 

Boersma (1995) described curricular modifications designed to increase 
students' ability to self-assess and set goals in grades K-5. Problems with 
self-evaluation and goal setting were documented through parent, teacher, 
and student surveys. Interventions included the development of a portfolio 
system of assessment and the implementation of reflective logs and response 
journals. These were successful in improving student self-evaluation and 
goal setting across the grades, but improvement was more marked for the 
older students. 

Rudd and Gunstone (1993) studied the development of self-assessment 
skills in science and technology in third grade children. Self-assessment was 
scaffolded through questionnaires, concept maps and graphs created by 
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students. Specific self-assessment concepts and techniques were introduced 
to the students during each term, over one academic year. Student awareness 
and use of skills in these classes were substantially enhanced. The teacher's 
role changed from controller to delegator as students became more proficient 
at self-assessment. 

There is some evidence that engagement in self-assessment has positive 
effects on achievement in schools. Sink, Barnett and Hixon (1991) found 
that planning and self-assessment predicted higher academic achievement in 
middle school students. Fontana and Fernandes (1994) tested the effects of 
the regular use of self-assessment techniques on mathematical performance 
with children in 25 primary school classes. Children (n=354) in these classes 
showed significant improvements in scores on a mathematics test, compared 
to a control group (n=313). In a replication, Fernandes and Fontana (1996) 
found children trained in self-assessment showed significantly less 
dependence upon external sources of control and upon luck as explanations 
for school academic events, when compared to a matched control group. In 
addition, the experimental children showed significant improvements in 
mathematics scores relative to the control group. 

Ninness, Ninness, Sherman and Schotta (1998) and Ninness, Ellis and 
Ninness (1999) trained school students in self-assessment by computer- 
interactive tutorials. Students received computer-displayed accuracy 
feedback plus reinforcement for correct self-assessments of their math 
performance. After withdrawal of reinforcement, self-assessment alone was 
found motivational, facilitating high rates and long durations of math 
performance. McDonald (2002) gave experimental high school students 
extensive training in self-assessment and using a post-test only design 
compared their subsequent public examination performance to that of 
controls, finding the self-assessment group superior. 

Additionally, self-assessment in schools is not confined to academic 
progress. Wassef, Mason, Collins, O'Boyle and Ingham (1996) evaluated a 
self-assessment questionnaire for high school students on emotional distress 
and behavioural problems, and found it reliable in relation to staff 
perceptions. 



2.6 Summary and Conclusions on Self Assessment 

Self-assessment is increasingly widely operated in schools and HE, 
including with very young children and those with special educational needs 
or learning disabilities. It is widely assumed to enhance meta-cognition and 
self directed learning, but this is unlikely to be automatic. The solid evidence 
for this is small, although encouraging. It suggests self-assessment can result 
in gains in learner management of learning, self-efficacy, deep rather than 
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superficial learning, and on traditional summative tests. Effects have been 
found to be at least as good as those from instructor assessment, and often 
better. However, effects might not be immediate and might be cumulative. 

The reliability and validity of instructor assessment is not high, but that 
of self-assessment tends to be a little lower and more variable, with a 
tendency to over-estimation. The reliability and validity of self-assessment 
tends to be higher in relation to the ability of the learner, the amount of 
scaffolding, practice and feedback and the degree of advancement in the 
course, rather than chronological age. Other variables affecting reliability 
and validity include: the nature of the subject area, the nature of the product 
or performance assessed, the nature and clarity of the assessment criteria, the 
nature of assessment instrumentation, and cultural and gender differences. 

In all sectors, much further development is needed, with improved 
implementation and evaluation quality and fuller and more detailed reporting 
of studies. Exploration of the effects of self-assessment is particularly 
needed. 



3. PEER ASSESSMENT 



3.1 Peer Assessment: Definition, Typology & Purposes 

Assessment is the determination of the amount, level, value or worth of 
something. Peer assessment is an arrangement for learners and/or workers to 
consider and specify the level, value or quality of a product or performance 
of other equal-status learners and/or workers. 

Peer assessment activities can vary in a number of ways, operating in 
different curriculum areas or subjects. The product or output to be assessed 
can vary - writing, portfolios, oral presentations, test performance, or other 
s ki lled behaviours. The peer assessment can be summative or formative. The 
participant constellation can vary - the assessors may be individuals or pairs 
or groups; the assessed may be individuals or pairs or groups. Directionality 
can vary - peer assessment can be one-way, reciprocal or mutual. Assessors 
and assessed may come from the same or different year of study, and be of 
the same or different ability. Place and time can vary - peer assessment can 
be formal and in class, or occur informally out of class. The objectives for 
the exercise may vary - the teacher may target cognitive or meta-cognitive 
gains, time saving, or other goals. 
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3.2 Peer Assessment - Theoretical Underpinnings 

What does peer assessment require from students in terms of cognitive, 
meta-cognitive and social-affective demands? Through what processes might 
these benefit students? Under what conditions might these processes be 
optimised? 

3.2.1 Feedback 

The conditions under which feedback in learning is effective are complex 
(Bangert-Drowns, Kulik, Kulik, & Morgan, 1991; Butler & Winne, 1995; 
Kulhavy & Stock, 1989). Feedback can reduce errors and have positive 
effects on learning when it is received thoughtfully and positively. It is also 
essential to the development and execution of self-regulatory skills (Bangert- 
Drowns, et al., 1991; Paris & Newman, 1990; Paris & Paris, 2001). Butler 
and Winne (1995) argue that feedback serves several functions: to confirm 
existing information, add new information, identify errors, correct errors, 
improve conditional application of information, and aid the wider 
restructuring of theoretical schemata. Students react differently to feedback 
from peers and from adults (Cole, 1991; Dweck & Bush, 1976; Henry, 
1979). Gender differences in responsiveness to peer feedback have also been 
found (Dweck & Bush, 1976), but this interacts with age (Henry, 1979). 

3.2.2 Cognitive Demands 

Providing effective feedback or assessment is a cognitively complex task 
requiring understanding of the goals of the task and the criteria for success, 
and the ability to make judgements about the relationship of the product or 
performance to these. Webb (1989) and Webb and Farivar (1994) identified 
conditions for effective helping: relevance to the goals and beliefs of the 
learner, relevance to the particular misunderstandings of the learner, 
appropriate level of elaboration, timeliness, comprehension by the help- 
seeker, opportunity to act on help given, motivation to act, and constructive 
activity. 

Cognitively, peer assessment might create effects by gains in a number of 
variables pertaining to cognitive challenge and development, for assessors, 
assessees, or both (Topping & Ehly, 1998, 2001). These could include levels 
of time on task, engagement, and practice, coupled with a greater sense of 
accountability and responsibility. Formative peer assessment is likely to 
involve intelligent questioning, coupled with increased self-disclosure and 
thereby assessment of understanding. Peer assessment could enable earlier 
error and misconception identification and analysis. This could lead to the 
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identification of knowledge gaps, and engineering their closure through 
explaining, simplification, clarification, summarising and cognitive 
restructuring. Feedback (corrective, confirmatory, or suggestive) could be 
more immediate, timely, and individualised. This might increase reflection 
and generalisation to new situations, promoting self-assessment and greater 
meta-cognitive self- awareness. Cognitive and meta-cognitive benefits might 
accrue before, during or after the peer assessment. Palchikov (1995, 2001) 
noted that "sleeper" (delayed) effects are possible. 

3.2.3 Social Demands 

Any group can suffer from negative social processes, such as social 
loafing, free rider effects, diffusion of responsibility, and interaction 
disabilities (Cohen, 1982; Salomon & Globerson, 1989). Social processes 
might influence and contaminate the reliability and validity of peer 
assessments (Byard, 1989; Palchikov, 1995; Pond, Ul-Haq, & Wade, 1995). 
Palchikov (2001) explores questions of role ambiguity, dissonance and 
conflict in relation to authority and status issues and attribution theory. Peer 
assessments might be partly determined by: friendship bonds, enmity or 
other power processes, group popularity levels of individuals, perception of 
criticism as socially uncomfortable or even socially rejecting and inviting 
reciprocation, or collusion leading to lack of differentiation. The social 
influences might be particularly strong with "high stakes" assessment, for 
which peer assessments might drift toward leniency (Farh, et al, 1991). 
Magin (2001 a) noted that concerns about peer assessment are often focused 
upon the potential for bias emanating from social considerations - so-called 
"reciprocity effects". However, in his own study he found such effects 
accounted for only 1 % of the variance. In any case, all these social factors 
require professional teacher scrutiny and monitoring. However, peer 
assessment demands social and communication skills, negotiation and 
diplomacy (Riley, 1995), and can develop teamwork skills. Learning how to 
give and accept criticism, justify one's own position and reject suggestions 
are all useful transferable social and assertion skills (Marcoulides & Simkin, 
1991). 

3.2.4 Affect 

Both assessors and assessees might experience initial anxiety about the 
process. However, peer assessment involves students directly in learning, 
and might promote a sense of ownership, personal responsibility and 
motivation. Giving positive feedback first might reduce assessee anxiety and 
improve acceptance of negative feedback. Peer assessment might also 
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increase variety and interest, activity and inter-activity, identification and 
bonding, self-confidence, and empathy with others - for assessors, assessees, 
or both. 

3.2.5 Systemic Benefits 

Peer assessment offers triangulation and per se seems likely to improve 
the overall reliability and validity of assessment. It can also give students 
greater insight into institutional assessment processes (Fry, 1990), perhaps 
developing greater tolerance of inevitable difficulties of discrimination at the 
margin. It has been contended that peer assessment is not costly in teacher 
time. However, other authors (e.g., Palchikov, 2001) caution that there might 
be no saving of time in the short to medium term, since establishing good 
quality peer assessment requires time for organisation, training and 
monitoring. If the peer assessment is to be supplementary rather than 
substitutional, then no saving is possible, and extra costs or opportunity costs 
will be incurred. However, there might be meta-cognitive benefits for staff 
as well as students. Peer assessment can lead staff to scrutinise and clarify 
assessment objectives and purposes, criteria and marking scales. 

3.3 Peer Assessment - Reliability and Validity 

This section considers the degree of correspondence between student 
peer assessments and the assessments made of student work by external 
"experts" such as professional teachers. Caveats regarding the use of the 
terms "accuracy", "reliability" and "validity" are as for self-assessment. 
Many purported studies of "reliability" might be considered studies of 
"accuracy" or "validity", comparing peer assessments with assessments 
made by professionals, rather than with those of other peers, or the same 
peers over time. 

Additionally, many studies compare marks, scores and grades awarded 
by peers and staff, rather than upon more open-ended formative feedback. 
This raises concerns about the uncertain psychometric properties of such 
scoring scales (such as sensitivity and scalar properties), alignment of the 
mode of assessment with teaching and learning outcomes (i.e. relevance of 
the assessment), and consequently validity in any wider sense. By contrast, 
the reliability and validity of detailed formative feedback was explored by 
Palchikov (1995) and Topping, Smith, Swanson, & Elliot (2000), for 
example. 

Research findings on the reliability and validity of peer assessment 
mostly emanate from studies in HE. In a wide variety of subject areas and 
years of study, the products and performances assessed have included: 
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essays (Catterall, 1995; Haaga, 1993; Marcoulides & Simkin, 1991, 1995; 
Orpen, 1982; Pond, et al., 1995), hypermedia creations (Rushton, Ramsey, & 
Rada, 1993), oral presentations (Freeman, 1995; Hughes & Large, 1993a,b; 
Magin & Helmore, 2001), multiple choice test questions (Catterall, 1995), 
practical reports (Hughes, 1995), individual contrihutions to a group project 
(Mathews, 1994; Mockford, 1994) and professional skills (Korman & 
Stuhhlefield, 1971; Ramsey, Carline, Blank, & Wenrich, 1996). Methods for 
computerising peer assessment are now appearing (e.g., Davies, 2000). 

Over 70% of the HE studies find "reliahility" and "validity" adequate, 
while a minority find these variable (Palchikov & Goldfinch, 2001; Topping, 
1998). MacKenzie (2000) reported satisfactory reliahility for peer 
assessment of performance in viva examinations. Magin & Helmore (2001) 
found inter-rater reliahility for tutors making parallel assessments of oral 
presentations higher than that for peer assessments, hut the reliahility for 
tutors was not high (0.40 to 0.53). Magin & Helmore (2001) concluded that 
the reliahility of summative assessments of oral presentations could he 
improved hy combining teacher marks with the averaged marks obtained 
from multiple peer ratings. A tendency for peer marks to bunch around the 
median is sometimes noted (e.g., Catterall, 1995; Taylor, 1995). Student 
acceptance (or belief in reliability) varies from high (Palchikov, 1995; Fry, 
1990; Haaga, 1993) to low (Rushton, et al., 1993), quite independently of 
actual reliability. 

Contradictory findings might be explained in part by differences in 
contexts, the level of the course, the product or performance being evaluated, 
the contingencies associated with those outcomes, clarity of judgement 
criteria, and the training and support provided. Reliability tends to be higher 
in advanced courses; lower for assessment of professional practice than for 
academic products. Discussion, negotiation and joint construction of 
assessment criteria with learners is likely to deepen understanding, give a 
greater sense of ownership, and increase reliability (see the review by 
Palchikov & Goldfinch, 2000 - although Orsmond, Merry and Reiling, 2000, 
found otherwise). Reliability for an aggregated global peer mark might be 
satisfactory, but not for separate detailed components (e.g., Lejk & Wyvill, 
2001; Magin, 2001b; Mockford, 1994). Peer assessments are generally less 
reliable when unsupported by training, checklists, exemplification, teacher 
assistance and monitoring (Lawrence, 1996; Pond, et al., 1995; Stefani, 
1992, 1994). Segers and Dochy (2001) found peer marks correlated well 
with both tutor marks and final examination scores. 

Findings from HE settings might not apply in other contexts. However, a 
number of other studies in the school setting have found encouraging 
consistency between peer and teacher assessments (Karegianes, Pascarella, 
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&, Pflaum, 1980; Lagana, 1972; MacArthur, Schwartz, & Graham, 1991; 
Pierson, 1967; Weeks & White, 1982). 

3.4 Peer Assessment in Schools: Effects 

Similar caveats about "what is a good result?" apply to peer assessment 
as to self-assessment. In schools, much peer assessment has focused on 
written products or multimedia work portfolios. A review has been provided 
by O'Donnell and Topping (1998). 

3.4.1 Peer Assessment of Writing 

Peer assessment of writing might involve giving general feedback, or 
going beyond that to very specific feedback about possible improvements. 
Peer assessment can focus on the whole written product, or components of 
the writing process, such as planning, drafting or editing. Studies in schools 
have shown less interest in reliability and validity than in HE, and more 
interest in effects on subsequent learner performance. Peer assessment of 
writing is also used with classes studying English as a Second or Additional 
Language (ESL, EAL) and foreign languages (Byrd, 1994; Samway, 1993). 

Bouton and Tutty (1975) reported a study of the effects of peer 
assessment of writing with high school students. The experimental group did 
better than control group in a number of areas. Karegianes, et al. (1980) 
examined the effects of peer editing on the writing proficiency of 49 low- 
achieving tenth grade students. The peer edit group had significantly higher 
writing proficiency than students whose essays were edited by teachers. 
Weeks and White (1982) compared groups of grade 4 and 6 students in peer 
editing and teacher editing conditions. Differences were not significant, but 
the peer assessment group showed more improvement in mechanics and in 
the overall fluency of writing. 

Raphael (1986) compared peer editing and teacher instruction with fifth 
and sixth grade students and their teachers, finding similar improvements in 
composition ability. Califano (1987) made a similar comparison in two fifth 
and two sixth grade classes, with similar results in writing ability and 
attitudes toward writing. Cover's (1987) study of peer editing with seventh 
graders found a statistically significant improvement in editing skills and 
attitudes toward writing. Wade (1988) combined peer feedback with peer 
tutoring for sixth-grade students. After training, the children could provide 
reliable and correct feedback, and results clearly demonstrated 
improvements in student writing. 

Holley (1990) found peer editing of grammatical errors with grade 12 
high school students in Alabama resulted in a reduction in such errors and 
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greater student interest and awareness. MacArthur and his colleagues (1991) 
used peer editing in grades 4-6 in special education classrooms, which 
proved more effective than only regular teacher instruction. Stoddard and 
MacArthur (1993) demonstrated the effectiveness of peer editing with 
seventh and eighth grade students with learning disabilities. The quality of 
writing increased substantially from pre-test to post-test, and the gains were 
maintained at follow-up and generalised to other written work. 

3.4.2 Peer Response Groups 

Peer response groups are a group medium for peer assessment and 
feedback, obviously involving different social demands to peer assessment 
between paired individuals. Gere and Abbot (1985) analysed the quality of 
talk in response groups. Students did stay on task and provide content- 
related feedback. Younger students spent more time on content than did 
older students, who attended more to the form of the writing. However, 
when Freedman (1992) analysed response groups in two ninth grade 
classrooms, she concluded that students suppressed negative assessments of 
their peers. 

The effects of revision instruction and peer response groups on the 
writing of 93 sixth grade students were compared by Olson (1986, 1990). 
Students receiving instruction that included both teacher revision and peer 
assessment wrote rough and final drafts which were significantly superior to 
those of students who received teacher revision only, while peer assessment 
only students wrote final drafts significantly superior to revision instruction 
only students. Rijlaarsdam (1987) and Rijlaarsdam and Schoonen (1988) 
made similar comparisons with 561 ninth grade students in eight different 
schools. Teacher instruction and peer assessment proved equally effective. 

Weaver (1995) surveyed over 500 instructors about peer response groups 
in writing. Regardless of the stage in the writing process (early vs. late), 
instructors generally found peer responses to be more effective than the 
teacher's. In contrast, students stated they found the teacher's responses to be 
more helpful in all stages of writing. Nevertheless, when students could seek 
peer responses at the Writing Centre but not in class, their writing improved. 

3.4.3 Portfolio Peer Assessment 

A portfolio is "a purposeful collection of student work that exhibits the 
student's efforts, progress, or achievement in one or more areas. The 
collection must include student participation in selecting contents, the 
criteria for judging merit, and evidence of the student's self reflection" 
(Paulson, Paulson, & Meyer, 1991, p. 60). Thus, a student must be able to 
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judge the quality of his or her own work and develop criteria for what should 
be included in order to develop an effective portfolio. However, there is as 
yet little adequate empirical literature on the effects of peer assessment of 
portfolios in schools. 

3.4.4 Other Kinds of Peer Assessment in Schools 

McCurdy and Shapiro (1992) deployed peers to undertake curriculum- 
based measurement in reading among 48 elementary students with learning 
disabilities, comparing with teacher and self-assessment. It was found that 
students in the self and peer conditions could collect reliable data on the 
number of correct words per minute. No significant differences were found 
between conditions. Salend, Whittaker and Reeder (1993) examined the 
efficacy of a consensus based group evaluation system with students with 
disabilities. The system involved: (a) dividing the groups into teams; (b) 
having each team agree on a common rating for the group's behaviour during 
a specified time period; (c) comparing each team's rating to the teacher's 
rating; and (d) delivering reinforcement to each team based on the group's 
behaviour and the team's accuracy in rating the group's behaviour. Results 
indicated that the system was an effective strategy for modifying behaviour. 
Ross (1995) had grade 7 students assess audio tape recordings of their own 
math co-operative learning groups at work. Increases in the frequency and 
quality of help seeking and help giving and improved students' attitudes 
about asking for help resulted. 

3.5 Peer Assessment in Higher Education: Effects 

Similar caveats about "what is a good result?" apply to peer assessment 
in HE as to self-assessment. In this section, studies of quantitative peer 
assessment are considered first, then other studies are grouped according to 
the product or performance assessed. 

3.5.1 Peer Assessment through Tests, Marks or Grades 

Hendrickson, Brady and Algozzine (1987) compared individually 
administered and peer mediated tests, finding scores significantly higher 
under the peer mediated condition. The latter was preferred by students, who 
found it less anxiety-provoking. Ney (1989) applied peer assessment to tests 
and mid-term and final exams. This resulted in improved mastery of the 
subject matter, and better classroom attendance. Stefani (1994) had students 
define the marking schedule for peer assessed experimental laboratory 
reports, and reported learning gains from the overall process. Catterall 
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(1995) had multiple choice and short essay tests peer marked by 120 
marketing students. Learning gains from peer assessment were reported by 
88% of participants, and impact on the ability to self-assess was reported by 
76%. Hughes (1995) had first year pharmacology students use a detailed 
model-marking schedule. Their subsequent performance in practical work 
increased in comparison to previous years, whose ability on entry was 
identical. Segers and Dochy (2001) found no evidence of any effect of peer 
marking on learning outcomes. 

3.5.2 Peer Assessment of Writing 

In a business communication class, Roberts (1985) compared peer 
assessment in groups of five with staff assessment. Pre- and post-tests 
showed a statistically significant difference in favour of the peer condition. 
Palchikov (1986) involved 48 biological science students in discussion and 
development of essay assessment criteria. They felt the peer assessment 
process was difficult and challenging, but helped develop critical thinking. A 
majority reported increased learning and better self-organisation, while 
noting that it was time-consuming. The effects of teacher feedback, peer 
feedback and self-assessment were compared by Birkeland (1986) with 76 
technicians, but no significant differences were found between conditions on 
test gains in paragraph writing ability. Richer (1992) compared the effects of 
peer group discussion of essays with teacher discussion and feedback. 
Grading of 174 pre- and post-test essays from 87 first year students indicated 
greater gains in writing proficiency in the peer feedback group. Hughes 
(1995) compared teacher, peer and self-assessment of written recording of 
pharmacology practical work, finding them equally effective. 

Graner (1985) compared the effect of peer assessment and feedback in 
small groups to that of assessment of another's work alone using an editorial 
checklist. Both groups then rewrote their essays, and final grading was by 
staff. Both groups significantly improved from initial to final draft, and no 
significant difference was found between the groups. This suggests that 
practising critical evaluation can have generalised effects on the evaluator's 
own work, even in the absence of any external feedback about their own 
work. Chaudron (1983) compared the effectiveness of teacher feedback with 
feedback from peers with either English or another language as their first 
language. Students in all conditions showed a similar pattern of 
improvement. Working with 81 college students of ESL in Thailand and 
Hawaii, Jacobs and Zhang (1989) compared teacher, peer and self- 
assessment of essays. The type of assessment did not affect informational or 
rhetorical accuracy, but teacher and peer feedback was found to be more 
effective for grammatical accuracy. 
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3.5.3 Peer Assessment of Oral & Presentation Skills 

Heun (1969) compared the effect on student self-concept of peer and 
staff assessment of four public speeches given by students. Compared to a 
control group, peer influence on the self-concept of students reached a 
significant level for the final speech, while instructor influence was non- 
significant across all four speeches. Mitchell and Bakewell (1995) found 
peer review of oral presentation skills led to significantly improved 
performance. Williams (1995) used peer assessment of oral presentations of 
critical incident analysis in undergraduate clinical practice nursing. 
Participants felt learning was enhanced, and the experience relevant to peer 
appraisal skills in the future working setting. 

3.5.4 Peer Assessment of Group Work & Projects 

Peer assessment has been used to help with the differentiation of 
individual contributions to small group projects (Conway, Kember, Sivan, & 
Wu, 1993; Palchikov, 1993; Goldfinch, 1994; Mathews, 1994), but empirical 
research on effects is hard to find. In a study of psychology students 
(Palchikov, 1993), group members and the lecturer negotiated self and peer 
assessment checklists of group process behaviours. Task-oriented behaviours 
proved easier to rate reliably than pro-social group maintenance behaviours 
such as facilitating the inclusion of quieter group members. Abson (1994) 
had marketing research students working in self-selected tutor-less groups 
use a simple five point rating scale on four criteria (co-operation, ideas, 
effort, reliability). A case study of one group suggested peer assessment 
might have made students work harder. Strachan and Wilcox (1996) used 
peer and self-assessment of group work to cope with increased enrolment in 
a third-year course in microclimatology. Students found this fair, valuable, 
enjoyable, and helpful in developing transferable skills in research, 
collaboration and communication. 

3.5.5 Peer Assessment of Professional Skills 

Peer assessment of professional skills can take place within the institution 
and/or out on practical placements or internships. In the latter case it is an 
interesting parallel to "peer appraisal" between staff in the workplace. It has 
been used by medical schools (e.g., Arnold, Willoughby, Calkins, Gammon, 
& Eberhart, 1981; Burnett & Cavaye, 1980; McAuley & Henderson, 1984), 
in pre-service teacher training (e.g., Litwack, 1974; Reich, 1975), and for 
other professions. It has also been used in short practical laboratory sessions 
(e.g., Stefani, 1992). Application is also reported in more exotic areas, such 
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as applied brass jury performances (Bergee, 1993), and a range of other 
musical performance arts (Hunter & Russ, 1995). Lennon (1995) considered 
tutor, peer and self-assessments of the performance of second year 
physiotherapy students in practical simulations. Students rated the learning 
experience highly overall. Also in physiotherapy, Orr (1995) used peer 
assessment in role-play simulation triads. Participants rated the exercise 
positively, but felt some anxiety about it. Ramsey and colleagues (1996) 
studied peer assessment of professional performance for 187 medical interns. 
The process was acceptable to the subjects, and reliability adequate despite 
the use of self-chosen raters. 

Franklin (1981) compared self, peer and expert observational assessment 
of teaching sessions with pre-service secondary science teachers. There were 
no differences between the groups in skill acquisition. A similar study by 
Turner (1981) yielded similar results. Yates (1982) used reciprocal paired 
peer feedback with fourteen special education student teachers, followed by 
self-monitoring. The focus was the acquisition and maintenance of the skill 
of giving specific praise to learning-disabled pupils. Peer feedback was 
effective in increasing student teachers' use of motivational praise, but not 
content-based praise. With self-monitoring rates of both kinds of praise were 
maintained. Lasater (1994) paired twelve student teachers to give feedback 
to each other during twelve lessons in a 5-week practicum placement. The 
participants reported the personal benefits to be improved self-confidence 
and reduced stress. The benefits to their teaching included creative 
brainstorming and fine-tuning of lessons, resulting in improved organisation, 
preparation, and delivery of lessons. 

3.5.6 Computer-Assisted Peer Assessment 

Wider availability of word processing and electronic mail has created 
opportunities for formative peer assessment in electronic draft prior to final 
submission, as well as distributed collaborative writing. For example. 
Downing and Brown (1997) describe the collaborative creation of hypertexts 
by psychology students, which were published in draft on the World Wide 
Web and peer reviewed via email. Rushton, Ramsey and Rada (1993) and 
Rada, Acquah, Baker, & Ramsey (1993) reported peer assessment in a 
collaborative hypermedia environment. Good correspondence with staff 
assessment was evident, but the majority of computer science students were 
sceptical and preferred teacher-based assessment. Brock (1993) compared 
feedback from computerised text analysis programmes and from peer 
assessment and tutoring for 48 ESL student writers in Hong Kong. Both 
groups showed significant growth in writing performance. However, peer 
interaction was rated higher for helpfulness in improving content, and peer 
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supported students wrote significantly more words in post-intervention 
essays. 



3.6 Summary and Conclusions on Peer Assessment 

The reliability and validity of teacher assessment is not high. That of peer 
assessment tends to be at least as high, and often higher. Reliability tends to 
be higher in relation to: the degree of advancement in the course, the nature 
of the product or performance assessed, the extent to which criteria have 
been discussed and negotiated, the nature of assessment instrumentation, the 
extent to which an aggregate judgement rather than detailed components are 
compared, the amount of scaffolding, practice, feedback and monitoring, and 
the contingencies associated with the assessment outcome. Irrespective of 
relatively high reliability, student acceptance is variable. 

In schools, research on peer assessment has focused less on reliability 
and more on effects. Students as young as grade 4 (9 years old) and those 
with special educational needs or learning disabilities have been successfully 
involved. The evidence on the effectiveness of peer assessment in writing is 
substantial, particularly in the context of peer editing. Here, peer assessment 
seems to be at least as effective in formative terms as teacher assessment, 
and sometimes more effective. The research on peer assessment of other 
learning outputs in school is as yet sparse, but merits exploration. In higher 
education, there is some evidence of impact of peer assessment on learning, 
especially in writing, sometimes greater than that of teacher assessment. In 
other areas, such as oral presentations, group skills, and professional skills, 
evidence for effects on learning are more dependent on softer data such as 
student subjective perceptions. Computer assisted peer assessment shows 
considerable promise. 

In all sectors, much further development and evaluation is needed, with 
improved methodological quality and fuller and more detailed reporting of 
studies. 



4. SELF VS. PEER ASSESSMENT 

In Higher Education, Burke (1969) found self-assessments unreliable and 
peer assessments more reliable. By contrast, Palchikov (1986) found self- 
assessments were more reliable than peer assessments. However, Stefani 
(1994) found peer assessment more reliable. Saavedra and Kwun (1993) 
found outstanding students were the most discriminating peer assessors, but 
their self-assessments were not particularly reliable. Shore, Shore and 
Thornton (1992) found construct and predictive validity stronger for peer 
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than for self-evaluations, and stronger for more easily observable dimensions 
than for those requiring inferential judgement. Furnham and Stringfield 
(1994) reported greater reliability in peer assessments by subordinates and 
superiors than in self-assessments. Wright (1995) found self-assessment 
generally yielded lower marks than peer assessment, but less so in a 
structured module than in a more open ended one. Lennon (1995) found a 
high correlation between peer assessments of a piece of work (0.85), but 
lesser correlations between self and peer assessment (0.61 - 0.64). However, 
correlations between tutor and self-assessment were even lower (0.21), and 
those between tutor and peer assessment modest (0.34 - 0.55). Self- 
assessment was associated with under-marking and bunching at the median. 

In general, peer assessment seems likely to correlate more highly with 
professional assessment than does self-assessment, and self and peer 
assessments do not always correlate well. Of course, triangulation between 
highly correlated measures is in any event redundant, and the processes here 
are at least as important as the actual judgement. 



5. SUMMARY AND CONCLUSION RE SELF 
ASSESSMENT AND PEER ASSESSMENT 

Both self and peer assessment have been successfully deployed in 
elementary and high schools and in higher education, including with very 
young students and those with special educational needs or learning 
disabilities. The reliability and validity of assessments by professional 
teachers is often low. The reliability and validity of self-assessment tends to 
be a little lower and more variable, while the reliability and validity of peer 
assessment tends to be as high or higher. Self-assessment is often assumed to 
have meta-cognitive benefits. There is some hard evidence that it can result 
in improvements in the effectiveness and quality of learning, which are at 
least as good as gains from teacher assessment, especially in relation to 
writing. However, this evidence is still quite limited. There is more 
substantial hard evidence that peer assessment can result in improvements in 
the effectiveness and quality of learning, which is at least as good as gains 
from teacher assessment, especially in relation to writing. In other areas the 
evidence is softer. Of course, self and peer assessment are not dichotomous 
alternatives - one can lead to and inform the other. Both can offer valuable 
triangulation in the assessment process and both can have measurable 
formative effects on learning, given good quality implementation. Both need 
training and practice, arguably on neutral products or performances, before 
full implementation, which should feature monitoring and moderation. Much 
further development is needed, with improved implementation and 
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evaluation quality. The following evidence-based guidelines for 
implementation are drawn from the research literature reviewed. 

Good quality of organisation is important for implementation integrity 
and consistent and productive outcomes. Important planning issues evident 
in the literature are: 

1. Clarify Purpose, Rationale, Expectations and Acceptability with all 
Stakeholders 

2. Involve Participants in Developing and Clarifying Assessment Criteria 

3. Match Participants & Arrange Contact (PA only) 

4. Provide Quality Training, Examples and Practice 

5. Provide Guidelines, Checklists or other tangible Scaffolding 

6. Specify Activities and Timescale 

7. Monitor the Process and Coach 

8. Compare Aggregated Ratings, not multiple components (PA only) 

9. Moderate Reliability and Validity of Judgements 

10. Evaluate and Give Eeedback 
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1. INTRODUCTION 

Assessment in science education is commonly applied to evaluate 
students, but it can be applied also to teachers and to entire schools (Nevo, 
1994; 1995). Lewy (1996) proposed that assessment be based on a set of 
tasks, including oral responses, writing essays, performing data 
manipulations with technology-enhanced equipment, and selecting a solution 
from a list of possible options. Similarly, in science education, student 
assessment is defined as a collection of information on students’ outcomes, 
both while learning is taking place - formative assessment, and after the 
completion of the learning task - summative assessment (Tamir, 1998). 
Commenting on the common image of testing and assessment. Black (1995) 
has noted: 

“Many politicians, and most of the general public, have a narrow view of 
testing and assessment. The only mode which they know and understand 
is the conventional test, which is seen as a reliable and cheap way of 
comparing schools and assessing individuals.” (p. 462). 

Since the mid eighties, educators’ awareness of the need to modify the 
traditional testing system in schools has increased throughout the western 
world (Black, 1995, 1995a). In the mid nineties, multiple choice items and 
standardized test scores have been supplemented with new methods, such as 
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portfolios, hands-on, performance assessment and self-assessment (Baxter & 
Shavelson, 1994; Baxter, Shavelson, Goldman, & Pine, 1992; Ruiz-Primo, & 
Shavelson, 1996; Tamir, 1998). Researchers are investigating the effect of 
alternative assessment on various groups of students (Birenbaum & Dochy, 
1996; Flores & Comfort, 1997; Lawrenz, Huffman, & Welch, 2001; 
Shavelson & Baxter, 1992). Other studies investigate how teaching and 
learning in science can benefit from embedded assessment (Treagust, 
Jacobowitz, Gallagher, & Parker, 2001). 

Focusing on assessment that is based on projects carried out by students, 
students’ assessment is closely related to alternative assessment, as defined 
by Nevo (1995, p. 94): 

“In alternative assessment, students are evaluated on the basis of their 
active performance in using knowledge in a creative way to solve worthy 
problems. The problems have to be real problems.” 

Projects are becoming an acceptable means for both teaching and 
assessment. Being a school-wide endeavour, a project performance can serve 
as a means to assess not only the individual student or student team, but also 
the school that designed and carried out the project. Formative and 
summative evaluations should provide information for project planning, 
improvement and accountability (Nevo, 1983, 1994). 

In recent years, new modes of assessment have been receiving 
researchers’ attention. When new modes of assessment are applied, students 
are evaluated on the basis of their ability to solve authentic problems. The 
problems have to be non-routine and multi-faceted with no obvious solutions 
(Nevo, 1995). In science education, the term embedded assessment is used in 
conjunction with alternative assessment when referring to an ongoing 
process that emphasizes integration of assessment into teaching. Teachers 
can use embedded assessment to guide instructional decisions for making 
adjustments to teaching plans in response to the level of students’ conceptual 
understanding (Treagust et ah, 2001). The combination of alternative and 
embedded assessment can potentially yield a powerful and effective set of 
tools for fostering higher order thinking skills (Dori, 2003). In this chapter, 
the term “new modes of assessment” refers to the combination of alternative 
and embedded assessment modes. 

The amount and extent of decisions that high-level administrators and 
education experts make is deemed by many as too high. To counter this 
trend, schools and teachers should be more involved in new developments in 
assessment methods (Nevo, 1995). Indeed, the American National Science 
Education Standards (NRC, 1996) indicated that teachers are in the best 
position to use assessment data to improve classroom practice, plan 
curricula, develop self-directed learners, report students’ progress, and 
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research teaching practices. According to Treagust et al. (2001), the change 
from a testing culture, which is the common assessment practice, to an 
assessment culture, be it embedded or alternative, is a systemic change. Such 
a profound reform mandates that teachers, educational institutions, and 
testing agencies rethink the educational agenda and the role of assessment. 

As participants in authentic evaluation, researchers cannot set aside their 
individual beliefs and viewpoints, through which they observe and analyse 
the data they gathered (Cuba & Lincoln, 1989). To attenuate the bias such 
individual beliefs cause, evaluation of educational projects should include 
opinions of the various stakeholders as part of the data. 

This chapter exposes the reader to three studies, which outline a 
framework for project-based assessment in science education. The studies 
describe new modes of assessment that integrate alternative and embedded 
assessment, as well as internal and external assessment. Three different types 
of population participated in the studies: six-graders, high school students, 
and junior-high school teachers in Israel. In all three studies, emphasis was 
placed on assessing higher order thinking skills. The studies are summarized 
and conclusions that enable the construction of a project-based assessment 
framework are drawn. 

2. PROJECT-BASED ASSESSMENT 

Project-based curriculum constitutes an innovative teaching/learning 
method, aimed at helping students cope with complex real world problems 
(Keiny, 1995; McDonald & Czerniac, 1994). The project-based 
teaching/learning method involves both theoretical and practical aspects. It 
can potentially convey to students explicit and meaningful subject matter 
content from various disciplines, in a concrete yet comprehensive fashion. 
Project-based learning enhances higher order thinking skills, including data 
analysis, problem solving, decision-making and value judgement. 
Blumenfeld, Marx, Soloway and Krajcik (1996) argued that project-related 
tasks tend to be collaborative, open-ended and to generate problems with 
answers that are often not predetermined. Knowledge generation is 
emphasized as students pose questions, gather data and information, interpret 
findings, and use evidence to draw conclusions. Individuals, groups, or the 
whole class, can actively participate in creating unique artefacts to represent 
their understanding of natural and scientific phenomena that the project 
involves. 

Project-based learning is discussed in several studies (Cheung, Hattie, 
Bucat, & Douglas, 1996; Solomon, 1993). Through their active participation 
in the project execution process, students are encouraged to form original 
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opinions and express individual standpoints. The project fosters students’ 
awareness of system complexity, and encourages them to explore the 
consequences of their own values (Zoller, 1991). 

While engaging in project-based curriculum, the traditional instruments 
for measuring literacy do not fully convey the essence of student 
performance. Mitchell (1992b) pointed out the contribution of authentic 
assessment to learning process, and the advantages of engaging students, 
teachers, and schools in the assessment processes. Others have proposed 
various means aimed at assessing project-based learning (Black, 1995; Nevo, 
1994; Tal, Dori, & Lazarowitz, 2000). 



3. DEVELOPING AND ASSESSING HIGHER 
ORDER THINKING SKILLS THROUGH CASE 
STUDIES 

Project-based assessment is suited to foster and evaluate higher order 
thinking skills. Resnick (1987) stated that although it is difficult for 
researchers to define higher order thinking skills, these skills could be 
recognized when they occur. Based on Costa (1985), Dillon (1990), 
Shepardson and Pizzini (1991), and using TIMSS (Shorrocks-Taylor, & 
Jenkins, 2000) taxonomy, research projects described in this chapter 
involved both low- and high-level assignments. A low-level assignment is 
usually characterized as having a definite, clear, “correct” response, so it is 
relatively easy to assess and grade it, and the assessment is, for the most part, 
on the objective and “neutral” side. Low-level assignments require the 
students to recall knowledge and understand concepts. The opposite is true 
for high-level assignments, where the variability and range of possible and 
acceptable responses is far greater, as there is not just one “school solution”. 
High-level assignments are open-ended and require various combinations of 
application, analysis, synthesis, inquiry, and transfer skills. Open-ended 
assignments promote different types of student learning and demonstrate that 
different types of knowledge are valued (Resnick & Resnick, 1992; Wiggins, 
1989; Zohar & Dori, 2003). By nature, assessing high-level assignments is 
more demanding and challenging than that of low-level ones, as the 
assessing teachers need to be able to embrace different viewpoints and 
accept novel ideas or original, creative responses that they had not thought of 
before. 

Performance assessment by means of case studies is a recommended 
practice in science teaching (Dori, 1994; Herried, 1994). The case study 
method, which fosters a constructivist learning environment, was applied in 
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all three studies described in this chapter. The underlying principles of the 
case study method are similar to the problem based or context based method. 
Starting at business and medical schools, the case study method has become 
a model for effective learning and gaining the attention of the student 
audience. Case studies are usually real stories, examples for us to study and 
appreciate, if not emulate. They can be close-ended, demanding correct 
answers, or open-ended, with multiple solutions because the data involves 
emotions, ethics or politics. Examples of such open-ended cases include 
global warming, pollution control, human cloning and mission to Mars 
(Herried, 1997). In addition to case studies, for assessing teachers, peer 
assessment was applied, while for assessing students, self-assessment was 
applied. Peer assessment encourages group interaction and critical review of 
relative performance and increases responsibility for one’s own learning 
(Pond &U1 Haq, 1997). 

In what follows, three studies on project-based assessment are described. 
All three studies are discussed with respect to Research goal and objectives. 
Research setting. Assessment, and Method of analysis and findings. 
Conclusions are then drawn for all three studies together, which provide the 
basis for the project-based assessment framework. The subjects of studies I 
and II are students, while those of study III are teachers. Students in study I 
are from elementary school, in study II - high school students, and in study 
III - Junior-high school science teachers. Studies I and III investigate the 
process of project-based learning and assessment, while the value-added of 
study II is the quasi experimental design with control groups. 



4. STUDY I - ELEMENTARY SCHOOL INDUSTRY- 

ENVIRONMENT COLLABORATIVE PROJECTS 

Many science projects concern the natural environment and advance 
knowing and appreciating nature as part of science education or education in 
general (Bakshi & Lazarowitz, 1982; Hofstein & Rosenfeld, 1996). Fewer 
sources refer to the industrial environment as part of contemporary human 
environment that allows project-based learning (Posch, 1993; Solomon, 
1993). This study investigated six-graders who were engaged in an industry- 
environment project. The project involved teams of students, guided by 
parents, who chose, planned and manufactured industrial products, while 
accounting for environmental concerns. The community played a major role 
in influencing the theme selection, mentoring the students, and assessing 
their performances (Don & Tal, 2000). 
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4.1 Research Goal and Objectives 

The research goal was to develop and implement a project-based 
assessment system for interdisciplinary learning processes. The objectives 
were to investigate the extent to which the projects contributed to developing 
students’ higher order thinking skills (Tal et ah, 2000). 

4.2 Research Setting 

The study was carried out in a community elementary school, where the 
community and students’ parents select portions from the school national 
curricula, develop local curricula and design enrichment materials. As 
Darling-Hammond (1994) noted, in schools that undergo restructuring, 
teachers are responsible for students’ learning processes and for using 
authentic tools to assess how students learn and think. 

The study was conducted during three years and included about 180 six- 
grade students. The main theme of the project focused on the nearby high- 
tech Industrial Park, which is located in a natural mountainous region. The 
industry-environment part of the school-based curriculum was taught 
formally during three months. The last eight weeks of that period were 
dedicated to the project, which was carried out informally, after school 
hours, in parallel with the formal learning. 

The objectives of the project included: 

• Exposing the students to real world problems and “learning-by-doing” 
activities; 

• Enabling students to summarize their learning by means of a portfolio 
and an exhibition that is open to the community at large; 

• Encouraging critical thinking and system approach; and 

• Eostering collaborative learning and social interactions among students, 
parents and community. 

Each year, parents, students and teachers chose together a new industry 
and environment related theme. Then, the teachers divided the students into 
heterogeneous teams of 10-12 students. Within each team every student was 
expected to help and be helped (Lazarowitz & Hertz -Lazarowitz, 1998). The 
student teams, guided by volunteer parents and community experts, were 
involved in studying the scientific background related to the project themes. 
Examples of project themes included building a battery manufacturing plant 
in the neighbourhood, designing a plant for recycled paper products and 
products related to road and public areas improvement. Teachers observed 
the group activities and advised the mentors. Experts from the community 
met and advised team members. 




A Framework for Project-Based Assessment in Science Education 



95 



All the decisions, processes, corresponding, programs and debates were 
documented by the students and collected for inclusion in the project 
portfolio, which was presented as an important part of the exhibition. Both 
the portfolio and the exhibition are new modes of assessment, suggested by 
Nevo (1995) as a means to encourage students’ participation in the 
assessment process, and to foster interaction between the students and their 
assessors. The last stage of the project was the presentation and exhibition in 
an “industrial exhibition”. The exhibition was planned according to the ideas 
of Sizer (1992), who suggested that the educational exhibition serves as a 
means of meaningful learning and demonstrates various student skills in the 
cognitive, affective and communicative domains. 

4.3 Assessment 

The assessment system comprised a suite of several assessment tools: 
pre- and post-case studies; CHEAKS questionnaire; portfolio content 
analysis; community expert assessment; and students’ self-assessment (Dori 
&Tal,2000). 

The knowledge part of CHEAKS - Children’s Environmental Attitude 
and Knowledge Scale questionnaire (Leeming, Dwyer, & Bracken, 1995) 
was used to evaluate students’ knowledge and understanding of key terms 
and concepts. Higher order thinking skills and learning outcomes were 
assessed through the use of pre- and post-case studies, in which the students 
were required to exercise decision making and demonstrate awareness of 
system complexity. 

The community experts assessed the exhibition, and the teachers assessed 
the portfolio. The project-team score was based on the assessment of the 
team’s portfolio and its presentation in the exhibition. The portfolios were 
also sent to an external national competition. Teachers assessed the case 
studies, while the students assessed themselves. The case study and self- 
assessing determined the individual student score. 

We used the content analysis to analyse the case studies and team 
portfolios. The purpose of the content analysis was to determine the level of 
student knowledge and understanding of key terms and concepts, one’s 
ability to analyse industrial-environmental problems, decision-making ability 
and awareness of system’s complexity. 

Several interviews with students, parents and teachers were conducted 
right after team meetings for raising specific questions or issues concerning 
the assessment process that needed to be discussed and clarified. This way, 
the process itself generated additional ideas for assessment criteria and 
methods. 
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In developing the project-based assessment system, we focused on two 
main aspects. One aspect was whether the assessed object is the individual 
student or the entire team. The other aspect was the assessing agent, i.e., who 
does the assessment (Tal et ah, 2000). The findings and conclusions were 
reflected in the final design of the system, which is summarized in Table 1. 



Table I. Methods, agents and criteria of students' assessment 



Assessment 

Method 



Agent 



Pre- and post-case 
studies 



Teachers 



CHEAKS Researchers 



Teachers 

Portfolio and 

experts 



Exhibition 



Community 

experts 



Self assessment Students 



Criteria 

• Identifying the problem 

• Posing questions/raising hypotheses 

• Argumentation 

• Talcing a stand 

• Knowledge and understanding 

• Using scientific, industrial and environmental 
concepts 

• Collecting and presenting scientific data 

• System thinking presented in industry-environment 
relationships, problem solving and decision making 

• Reflective thinking 

• Conceptualization - the contribution of the project to 
understanding environmental problems 

• Product design 

• Exhibition design 

• Manufacturing process 

• Marketing and advertisement 

• Environmental considerations 

• Team oral presentations 

• Attending team meetings 

• Listening to team members 

• Collaboration with peers 

• Initiatives within the team 

• The number of chapters in the portfolio to which the 
student contributed 

• Team activities to which the student contributed 

• Specific individual contribution 

• Project’s influence on student’s school life 

• Socialization difficulties within team; relation to 
mentors 



The case study, with which students had to deal, concerned establishing a 
new industrial area in the Western Galilee. The Regional Council was in 
favor of the plan, as it would benefit the surrounding communities. Many 
objected to the plan. One assignment that followed this case study was as 
follows: Think of possible reasons for rejecting the plan and write them 
down. 
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4.4 Method of Analysis and Findings 

In students’ responses to the case study assignment, 21 different 
hypotheses were identified and classified into two categories: 
economic/societal and environmental. Each category was classified into one 
of three possible levels: high, intermediate, and low. Three science education 
experts validated the classification of the hypotheses by category and level. 
Table 2 presents examples for the hypotheses classified by the categories and 
levels. All four teachers who participated in the project established the 
scientific content validity of a random sample of 20% of the assignments. 



Table 2. Hypothesis examples by category and level for the case study assignment 



Category - Level 


Example 


Economic/societal - low 


New plants might cause competition. 


Economic/societal - high 


It could be unprofitable and a waste of money, because the 
region is unpopulated and there are enough irulustrial 




areas. 


Environmental - low 


In the western Galilee there are beautiful landscapes. 
Industry might harm them. 


Environmental - intermediate 


Air and water pollution may damage wildlife. Microbes 
and diseases may affect us. 


Environmental - high 


Industrial areas cause a lot of damage like air and water 
pollution by chemicals. This is dangerous to people and 
wildlife, landscapes are ruined to allow construction and 
the whole region changes. 



Relating the case study scores to the CREAKS scores, Eigure 1 shows 
the pre- and post-course CREAKS knowledge and the case study scores. The 
improvement for the entire population was significant (p < 0.0001) in both 
knowledge, required in the CREAKS, and high order thinking skills, 
required in the case study assignments. The lack of a control group was 
compensated for by using additional assessment modes, including portfolio, 
exhibition with external reviewers, and participation in national competition. 
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Study 



Figure I. Pre- and post-course CHEAKS knowledge and the case study 

scores 

The portfolios were written summaries of the teams’ work, and served for 
teachers as assessment objects. The portfolios included description of the 
teamwork, product selection, surveys enacted, product design, planning the 
manufacturing process, planning the marketing strategy, financial programs 
and marketing policy. It also contained reflections about the team 
collaborative work and contribution to understanding of technological- 
environmental conflicts. 

To analyse the portfolios we defined five general categories. Three of 
them are listed below along with examples from the portfolios. 

• System thinking presented in industry-environment relationships: 
“Highly developed industry improves the economical situation, which, in 
turn, improves life quality. ” (economic/societal consideration). 

• Reflective thinking: “We learned what team work is and how hard it is. 
We understand what taking responsibility and independent work 
means. ” 

• Conceptualisation: “All these problems lead us to think about solutions, 
because if we will do nothing, reality will be worse. Having done our 
project, we are more aware of the environment... There are many 
possible solutions: regulations of the Ministry of the Environment about 
air pollution and chemical waste, preventing the emission of poisonous 
pollutants from industrial plant chimneys by raising them and by 
installing filters, using environment friendly materials, monitoring 
clearance and treatment of sewerage and solid waste... ” 

The process of analysing the portfolios according to these categories 
enabled us to grade the portfolios on a scale of five levels at each category. 
In the first year, three of the five teams got a high grade, while the other two 
got an intermediate grade. In the second year, two of the seven teams got a 
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high grade, three teams got an intermediate grade, and two teams got a lower 
grade. 

The distinction between internal assessors and external evaluators does 
not imply any value judgment regarding the advantage of one over the other 
(Nevo, 1995). Both functions of assessment were important in our study. 
However, the external evaluation helped us in demonstrating accountability. 

External reviewers evaluated the portfolios, in addition to the school 
teachers. A national competition of student works was conducted by the 
Yubiler Institute in the Hebrew University in Jerusalem. All five portfolios 
of the pilot study (the first of three years) were submitted to the competition, 
and achieved the highest assessment of “Special Excellence”. The open 
dialogue between the internal assessment - both formative and summative, 
and the external one - formative evaluation, as presented in the portfolio 
assessment, contributed to the generalization power of our method (Tal et 
ah, 2000). 

The project’s pinnacle was the industrial exhibition, where the students 
presented the products and portfolios. The presentation included the 
manufactured products, the marketing program and tools, the students' 
explanations about the environmental solutions and the teams' portfolios. 
Various experts and scholars from the community were invited to assess the 
teams’ work. They represented various domains, including industry, 
economy, education, design and art. 

The teachers and guiding parents had prepared a list of assessing criteria 
(see Table 1). The community experts interviewed each team and suggested 
two best teams for each criterion and were impressed by the students’ ability 
to acquire technological-environmental literacy. 

To accomplish a comprehensive project-based assessment, we elicited 
the students’ point of view through a self-assessment questionnaire. The 
questionnaire was developed as a result of negotiations with the students. 
The students suggested self-assessment criteria, including attendance in team 
meetings, listening to team members, cooperation with peers, and initiatives 
within the team and the sub-teams. In this questionnaire, almost all the 
students ranked as very high the criteria of initiatives within the sub-team 
(86%), attendance in team meetings (74%), and cooperation with peers 
(67%). They indicated that the project helped them develop reflections and 
self-criticism. 
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5. STUDY II - “MATRICULATION 2000” PROJECT 
IN ISRAEL: SCHOOL-BASED ASSESSMENT 

Matriculation examinations in Israel have been the dominant summative 
assessment tool of high school graduates over the last half-century. The 
grades of the matriculation examinations, along with a psychometric test 
(analogous to SAT in the USA), are a critical factor in college and university 
admission requirements. This nationwide battery of tests is conducted 
centrally in seven or eight different courses, including mathematics, 
literature, history, English and at least one of the sciences (physics, 
chemistry, or biology). The Ministry of Education determines the goals and 
contents of each course. A national committee appointed by the Ministry is 
charged with composing the corresponding tests and setting criteria for their 
grading. This leaves the schools and the teachers with little freedom to 
modify either the subject matter or learning objectives. However, students’ 
final grade in the matriculation transcript for each course is the average of 
the school grade in the course and the pertinent matriculation examination 
grade. 

A national committee headed by Ben-Peretz (1994) examined the issue of 
the matriculation examinations from two aspects: Pedagogical - quality of 
teaching, learning and assessment; and socio-cultural - the number and 
distribution of students from diverse communities eligible for the 
Matriculation Diploma. 

Addressing the socio-cultural aspect, several researchers (Gallard, 
Viggiano, Graham, Stewart, & Vigiliano, 1998; Sweeney & Tobin, 2000) 
have claimed that educational equity goes beyond the notion of equal 
opportunity and freedom of choice. The way learning is fostered should be 
examined to verify whether students are allowed to use all the intellectual 
tools that they bring with them to the classrooms. 

The Ben-Peretz Committee indicated that in their current format, the 
matriculation examinations do not reflect the depth of learning that takes 
place in many schools, nor do they measure students’ creativity. The 
Committee’s recommendations focused, among other issues, on providing 
high schools with increased autonomy to apply new modes of assessment 
instead of the nationwide matriculation examination. The school-based 
assessment would combine traditional examinations with new modes of 
assessment in a continuous fashion throughout high school, from 10th 
through 12th grade. In addition to tests, the proposed assessment methods 
included individual projects, portfolios, inquiry laboratory experiments, 
assignments involving teamwork, and article analysis. The Committee called 
for nominating exemplary schools, which would be mentored and monitored 
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by experts in one, two, or three courses in each school. The school grades in 
those courses would be recognized as the standard matriculation grades. 

As a result of the Ben-Peretz Committee’s recommendations, the 
Ministry of Education launched a five-year project, titled “Matriculation 
2000.” The Project aimed at developing deep understanding, higher order 
thinking skills, and students’ engagement in learning through changes in 
both teaching and assessment methods. During the period of 1995-1999, 22 
schools from various communities participated in the project. The courses 
taught in these schools under the umbrella of the “Matriculation 2000” 
Project were chemistry, biology, English, literature, history, social studies, 
bible, and Jewish heritage. In the liberal art courses the most prevalent 
assessment methods were individual projects, portfolios, assignments 
involving teamwork, and presentations to peers. In the science courses, 
portfolios, inquiry laboratory experiments, assignments involving teamwork, 
concept maps, and article analysis were the most widely used assessment 
methods. 

An expert group accompanied each school, providing the teachers with 
guidance in teamwork, school-based curriculum, and new modes of 
assessment. These expert groups were themselves guided and managed by an 
overseeing committee headed by Ben-Elyahu (1995). 

5.1 Research Goal and Objectives 

The research goal was to investigate students’ learning outcomes in 
chemistry, biology and literature in the “Matriculation 2000” Project. The 
assumption was that new modes of assessment have some effect on students’ 
outcomes in both affective and cognitive domains. The research objectives 
were to investigate the attitudes that students express toward new modes of 
teaching and assessment applied in the Project and the Project’s effect on 
students’ achievements in chemistry, biology, and literature. 

5.2 Research Setting 

The research population included two groups of students in six 
heterogeneous high schools (labelled School A through School E) out of the 
22 exemplary schools that participated in the “Matriculation 2000” Project. 

The first group, which included students from 10th and 12th grades (N = 
561) served the investigation regarding the effect of the project on students’ 
affective domain. The Israeli high school starts at 10th grade and ends at 
12th grade, therefore, tenth grade was the first year a student participated in 
the Project, while 12th grade was the last one. The schools represented a 
variety of communities, academic levels, and sectors, including urban. 
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secular, religious, and Arab schools. The students from these six schools 
responded to attitude questionnaires regarding new modes of teaching and 
assessment applied in the Project (see Table 3). The courses taught in these 
six schools were chemistry, biology, literature, history, and Jewish heritage. 

All the students in the Project who studied chemistry and biology took 
the courses at the highest level of 5 units, which is comparable to an Honors 
class in the US high school system and A Level in the European system. 
Most of the students who studied liberal arts courses took them at the basic 
level of 2 units, which is comparable to Curriculum II in the US high school 
system and O Level in the European system. In School D and School E, one 
science course and one liberal arts course were taught in the framework of 
the “Matriculation 2000” Project. In the other four schools, only one course 
was taught as part of the Project. 

The second group, described in Table 3, served the investigation 
regarding the effect of the project on students’ cognitive domain. In four out 
of the six experimental schools, 214 12th graders responded to achievement 
tests in chemistry (School A), biology (School E) and literature (School B 
and School C). These students served as the experimental group for 
assessing achievements. Another 162 12th grade students, who served as a 
control group, responded to identical achievement tests in chemistry, 
biology, and literature. These students were from two high schools (labelled 
G and H), which did not participate in the Project, but were at an academic 
and socio-economic level comparable to that of the experimental schools. 

To enable comparison between the experimental and control groups, the 
grades that teachers had given to the students in the participating schools 
were collected. No significant differences in chemistry and biology between 
the experimental and the control students were found. In literature, there was 
a significant difference in favour of the experimental students (Xt,pcnmcni = 
77, X control = 72; t = -2.89, p < O.OI), but since the difference was only 0.05 
(5 points out of 100) and this difference was found only in literature, the 
experimental and the control groups were considered as identical. 

The new modes of assessment applied in the experimental schools 
included portfolios, individual projects, team projects, written and oral tests, 
class and homework assignments, self assessments, field trips, inquiry 
laboratory activities, concept maps, scientific article reviews, and project 
presentations. These methods were integrated into the teaching throughout 
the school year and therefore constituted an embedded assessment. The most 
prevalent methods, as reported by teachers and principals, were written tests, 
class and homework assignments, individual or group projects, and scientific 
article reviews. In chemistry, the group effort was a mini-research project 
that spanned over half a year. Students were required to raise a research 
question, design an experiment to investigate the question, carry it out, and 
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draw conclusions from its outcomes. In biology and literature, the students 
presented individual projects to their peers in class and expert visitors in an 
exhibition. In literature, the project included selecting a subject, stage and 
play it, or design a related visual artefact (Dori, Barnea, & Kaberman, 1999). 

To gain deeper insight into the Project setting, consider School A. In the 
middle of 10th grade, students in this school were given the opportunity to 
decide whether they wanted to elect chemistry at the Honors level. Students 
who chose this option, studied in groups of 20 per class for eight hours per 
week throughout 11th and 12th grades. These students focused on 80% of 
the topics included in the national, standard Honors chemistry curriculum, 
but they were exposed also to many more laboratory activities as well as to 
scientific articles. New modes of assessment were embedded throughout the 
curriculum. The teachers’ teamwork included a weekly two-hour meeting for 
designing the individual and group projects, their theoretical and laboratory 
contents, along with additional tools and criteria for assessing these projects. 
Teachers graded the projects and scientific article reviews according to topic 
rather than class affiliation. They claimed that this process increased the 
level of reliability and objectivity of the grades. 

Table 3. Research population and research instruments. 







Courses 
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Courses for 
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5.3 Assessment 

Students’ attitudes toward the Project are defined in this chapter as students’ 
perceptions of the teaching and assessment methods in the Project. These, in 
turn, were measured by the attitude questionnaires. Following preliminary 
visits to the six experimental schools, an initial open attitude questionnaire 
was composed and administered to 50 11th grade students in one of the 
experimental schools. Based on responses to this preliminary questionnaire, 
a comprehensive two-part questionnaire was compiled. Since there were 160 
items, they were divided into two questionnaires of 80 items each, with each 
question in one questionnaire having a counterpart in the other. In part A, 
items were clustered in groups of five or six. Each group of items referred to 
a specific question that represented a category. Examples for such categories 
are “What is the importance of the Project?” and “Compare the teaching 
and assessment methods in the Project with the traditional ones. ” Part B 
included positive and negative items that were mixed in a random fashion 
throughout the questionnaire without specifying the central topic being 
investigated. Eor the purpose of analysis, negative items were reversed. All 
items in part B were later classified into the following categories: students’ 
motivation and interest, learning environment, students’ responsibilities and 
freedom of choice, and variety of teaching and assessment methods. 
Students were asked to rank each item in both parts on a scale of 1 to 5, 
where 1 was “totally disagree” and 5 was “totally agree.” 

The effect of the Project on students’ performance in chemistry, biology, and 
literature was measured through a battery of achievement tests. These tests 
were administered to the experimental and control 12th grade students. 
Three science education/literature experts constructed each test and set 
criteria for its grading. Two other senior science/literature teachers, who 
were on sabbatical that year and hence did not teach any course, read and 
graded each test independently. The final test grade was computed as the 
average of the scores assigned by the two graders. In less than 5% of the 
cases, the difference between the grades each senior teacher assigned was 
greater than 10 (out of 100) points. In such cases, one of the experts, who 
participated in constructing the test and the criteria, also evaluated the test 
independently. This expert, who took in account the three grades, determined 
the final grade. 

The assignments in these tests referred to a given unseen: case study (in 
science) or a poem (in literature) and were categorized into low-level and 
high-level ones. 
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5.4 Method of Analysis and Findings 

The scores of students’ responses to the attitude questionnaires (where the 
scale was between 1 and 5) ranged from 2.50 to 4.31 on the average per 
item. Following are several items and their corresponding scores. The item 
that scored the highest was “The assessment in the Project is based on a 
variety of methods rather than a single test”. Another high-scoring item 
(4.15) was “The Project enables self expression through creative projects and 
assignments, not just tests.” A relatively high score of 3.84 was obtained for 
the item reading “Many students take part in class discussions.” The lowest 
score, 2.50, was obtained for the item regarding the existence of a joint 
teacher-student team whose task was to determine the yearly syllabi. 
Another item that scored low (2.52) was “Students ask to reduce the number 
of weekly lessons per course.” 

Table 4 presents students’ attitude scores for the three highest scoring 
categories (formulated as questions) in part A. The highest four items in each 
category are listed in descending order, along with their corresponding 
scores. The average per category accounts for all the items in the category, 
not just the four ones that are listed. Therefore, the category average is 
somewhat lower than the average of the four highest items in the category. 

To find out about the types of changes participating students would like to 
see taking place in the “Matriculation 2000” Project, two complementary 
questions were posed. The first question, which appeared in one of the 
questionnaire versions, was “What would you like to increase or modify in 
the Project?” It included the items “Include more courses”, “Include more 
creative projects”, “Include more teamwork”, “Keep it as it is now”, and 
“Discontinue it in our school”. The two responses “strongly agree” and 
“agree” were classified as “for” and the two responses “strongly disagree” 
and “disagree” were classified as “againsf’. More than 60% of the students 
who responded to this question preferred that the Project include more 
courses and more creative projects. More than half of the students disagreed 
or strongly disagreed with the item calling for the Project to be discontinued 
in their own school. 

The complementary question, which appeared in the other questionnaire 
version, was “What would you like to reduce or eliminate from the Project ” 
More than half of the students agreed with the item “Reduce time-consuming 
projects” while 43% agreed with the item “Eliminate all examinations”. 
About 80% were against cancelling the Project, 57% disagreed with the item 
“Do not reduce anything,” and 52% disagreed with reducing the amount of 
teamwork. 
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Table 4. Students’ attitude scores for the three highest scoring categories 



Category 


Score 


Highest scoring items 


Item 

Score 






The assessment in the Project is based on a variety of methods 


4.31 


Effect of 




rather than a single test 




Project on 




The Project enables both individualized work and teamwork 


3.95 


teaching and 


3.95 


The teacher is more attentive to what students have to say 




assessment 




The teacher serves as both advisor and tutor rather than just 


3.89 


methods 




lecturer 


3.70 






There are many opportunities to improve students’ grades 


4.15 


Advantages of 




There is no pressure to cover all the course material for the 




the new 




Matriculation examination 


3.95 


modes of 


3.89 


Students are more active and involved in the learning process 




teaching 




Students are more active and involved in the assessment 


3.94 


methods 




process 


3.83 






The Project enables self-expression through creative projects 


4.15 






and assignments, not just tests 

The Project reduces stress by eliminating the course’s 


4.11 


Project 


3.70 


Matriculation examination 




importance 


The Project exposes students to interesting learning 


3.85 






approaches 

Students with learning difficulties get a chance of obtaining 
better scores 


3.84 



Overall, students were supportive of continuing the Project, were in favour 
of adding more courses into the Project’s framework, and preferred more 
creative projects and fewer examinations. At the same time, students were in 
favour of decreasing the workload. 



Table 5. Average scores of high-level assignments by Project course and research group 



Course 


Research Group 


N1 


High-level assignments 




X 


T 


Chemistry 


Experimental 


59 


82.7 




Control 


38 


63.5 




Biology 


Experimental 


81 


64.0 


4 87** 


Control 


65 


55.7 




Literature 


Experimental 


74 


61.7 


01* 


Control 


59 


.59.2 





♦p < 0.01 ; **p< 0.0001 



A detailed description of the various types of the assignments is provided 
elsewhere (Dori, 2003). The findings regarding students’ achievements have 
shown that the experimental students achieved significantly higher scores 
than their control group peers on assignments that required knowledge and 
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understanding. For example, in chemistry, experimental and control students 
scored an average of 80.1 and 57.4, respectively (p < 0.001). In high-level 
assignments (see Table 5), the differences between the two research groups 
were greater and the gap was wider compared with the respective differences 
in knowledge-level assignments. Some of this wide gap can be attributed to 
the fact that a lower level of knowledge in the control group hampered their 
achievements at the high-level assignments. At any rate, this gap is a strong 
indication that the “Matriculation 2000” Project has indeed attained one of 
its major objectives, namely, fostering higher order thinking skills. This 
outcome is probably a result of the fact that students worked on the projects 
both individually and in teams, and had to discuss scientific issues that relate 
to daily life complex problems. 

The national standardized system and the school-based assessment 
system co-exist, but for the purpose of university admission, a weighted 
score is computed, which accounts for both matriculation examination score 
(which embodies an element of school assessment) and the score of a battery 
of standard psychometric tests. 



6. STUDY III - JUNIOR-HIGH SCHOOL SCIENCE 
TEACHERS PROJECTS 

In response to changes in science and technology curricula, the Israeli 
Ministry of Education decided (Flarari, 1994) to provide teachers with a 
series of on-going Science and Technology workshops of one day per week 
for a period of three academic years. This research followed two groups of 
teachers who participated in these workshops at the Department of 
Education in Technology and Science at the Technion. The workshops 
included three types of enrichment: theoretical, content knowledge and 
pedagogical content knowledge (Shulman, 1986). 

6.1 Research Goal and Objectives 

The goal of the research was to study various aspects of the new modes 
of assessment approach in the context of teachers’ professional development. 
The research objectives were to investigate how teachers developed learning 
materials of interdisciplinary nature and system approach and elements of 
new modes of assessment, how they viewed the implementation of new 
modes of assessment in their classrooms, and how these methods could be 
applied to assess the teachers’ deliverables (Dori & Herscovitz, 2000). 
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6.2 Research Setting 

The research population included about 50 teachers, 60% of whom came 
from the Jewish sector and 40% from the Arab sector. About 80% of the 
population were women, 65% were biology teachers, and the rest were 
chemistry, physics, or technology teachers. About 67% of these science and 
technology teachers had over 10 years teaching experience. 

During the three years of their professional development, the junior-high 
science teachers were exposed to several science topics in the workshops. 
Scientific, environmental, societal, and technological aspects of these topics 
were presented through laboratory experiments, case studies and cooperative 
learning. During the first two years, teachers were required to carry out three 
projects. The assignments included choosing a topic related to science and 
technology, which was not covered in the workshops. While applying 
system approach, the teachers had to develop a case study and related 
student activities as part of the project. 

The first project, “Elements,” which was carried out toward the end of 
the first year, concerned a case study on a chemical element taken from a 
popular science Journal. The teachers got the article and were asked to adapt 
it to the students’ level and design student activity, which would follow 
reading it. 

The second project, “Air Pollutants”, was carried out during the middle 
of the second year. Here, teachers were required to search for an appropriate 
article that discussed this topic and dealt with a scientific/technological 
issue. Based on the article they selected, they had to design a case study 
along with an accompanying student assignment. 

The third and final project, which started toward the end of the second 
year, included preparing a comprehensive interdisciplinary teacher-chosen 
subject, designing a case study and student activities, and implementing it in 
their classes. The first, second and third projects were done individually, in 
pairs and in groups of three to four teachers, respectively. The third project 
was taught in the teachers’ own classrooms, and was accompanied by peer 
and teacher assessment, as well as students’ feedback. 

6.3 Assessment 

The individual teacher, peers and the workshop lecturer assessed the first 
projects. The objective of this assessment was to experience the use of new 
modes of assessment. In the second project, each pair presented their work 
orally to the entire group, and the pair, the other pairs and the lecturer 
assessed it. The third project was presented by each group in an exhibition, 
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and was evaluated using the same criteria and the same assessment scheme 
and method as the two previous projects. 

Setting criteria for assessing the projects preceded the new modes of 
assessment that the science teachers applied. Groups of 3-4 teachers set these 
criteria after reading their peer project’s portfolios. Six criteria were finally 
selected in a plenary session. Some of these criteria, such as 
design/aesthetics, and originality/creativity, were concerned with project 
assessment in general, and were therefore also applicable to students’ 
projects. Other, more specific criteria related to the assessment of the teacher 
portfolios, and included interdisciplinarity, suitability for the students, and 
variability of the accompanying activities. 

The originality /creativity criterion was controversial. While most groups 
proposed a criterion that included these elements, it was apparent that 
objective scoring of creativity is by no means a straightforward matter. One 
group therefore suggested that this criterion would add a bonus to the total 
score. Teachers were also concerned about the weight assigned to each 
criterion. The decision was that for peer assessment during the workshops, 
all criteria would be weighted equally, while for classroom implementation, 
the teacher would have the freedom to set the relative weights after 
discussing it with his/her students. 

6.4 Method of Analysis and Findings 

The criteria proposed by the teachers, along with additional new ones, 
were used to analyse both the case study and the accompanying activities 
that teachers had developed. 

The analysis of the case study was based on its level of interdisciplinary 
nature and system approach, as well as the suitability to students’ thinking 
skills. Two science and environmental education experts validated the 
classification and analysis of the case studies and the related activities. 
Analysing the case studies teachers developed, we found that they went 
through a change from viewing only their own discipline to a system 
approach that integrates different science disciplines. The level of suitability 
of the case study to the target population increased accordingly. The 
statistical mode of the level of interdisciplinarity (number of disciplines 
integrated) in the case studies increased from one in the first project (with 
frequency of 50%) to two in the second project (50%) and to three in the 
third (80%). In parallel, the suitability for students increased from low (42%) 
through intermediate (37%) to high (60%). 

For the student activity’s assessment we used four categories: (1) 
interdisciplinarity; (2) variety; (3) relation to the case study; and (4) 
complexity (Herscovitz & Dori, 2000). The score for each criterion in each 
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category for assessing the student activities that followed the case study are 
presented in Table 6. 



Table 6. Categories and scores for assessing case study student activities 



Category 


Points 


Score for each criterion 


Level of inter- 


1 


One domain only (scientific, societal, etc.) is involved. 


disciplinary 




Two domains are involved. 


nature 


2 

3 


Three domains are involved. 




1 


All activities are of the same type; teacher questions to the 
student. 


Variability 


+1 


An extra point is given for each additional activity (e.g., 
experiment, movie, concept map, class discussion, field trip, 
debate). 




1 


Low; superficial treatment, which does not touch the essence 
of the problem; activity has little to do with the case study. 
Intermediate; reasonable treatment and relation to the case 


Relation to the 


2 


study; no deep treatment of the problem raised in the case 
study. 


Case Study 


3 


High; deep, serious treatment of the case study; gradual, 
logical construction that leads to profound student 
understanding. 




1 


Low-order thinking skill; the answer is contained in the case 
study. It requires knowledge and understanding only. 
High-order thinking skill; the answer is, at most, partially 


Complexity 


2 


contained in the case study. It requires analysis and synthesis. 
Very high-order thinking skill; the answer is not contained in 
the case study. It requires value judgement, system approach. 




3 


argumentation, or assessment. 
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Figure 2. Total assessment of case study activities in the three projects 

Figure 2 shows a clear trend of improvement of the total score of case study 
activities in each one of the three projects. The most frequent score range in 
project 1 was 6 to 15 (with frequency of 80%), in project 2 - 16 to 25 (50%), 
and in project 3 - 26 to 35 (70%). In project 3, 20% of the works were 
ranked in the range of 52 to 85. No work in project 1 or 2 was ranked in this 
range. The grades for the variability of the accompanying activities increase 
as well. 

Using the criteria the teachers had set, they assessed their own project, as 
well as their peers’, providing only verbal remarks without assigning 
numerical values. Teachers’ opinions towards performance of peer 
assessment in class met with decreasing resistance as the projects 
progressed. For most teachers, the criteria-setting process was a new, 
inspiring experience. One teacher noted: “I had heard about student and 
peer assessment, but I had no idea what it entails and how it should be 
implemented. Now I know that I need to involve my students in setting the 
criteria. ” Another teacher indicated: “I had hard time explaining to my 
students why one student portfolio got a higher score than another. Thanks 
to the discussion during the workshop, the issue of criteria setting became 
clear... Involving students in setting the criteria and in assessing their own 
work as well as their peers’, fosters involvement and enhances the 
collaborative aspect of their work. ” A third teacher said that the discussion 
with her peers about the new modes of assessment contributed to her 
pedagogical knowledge: “I realized that I need to add knowledge questions 
in student activities following the case study, so that the low academic level 
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students can participate in group discussions and demonstrate their point of 
view. ” 

The teachers were enthusiastic about these new modes of assessment, in 
which teachers and students may play the role of equal partners. Some 
expressed readiness to implement this approach in their classrooms and 
indeed invited the lecturer to observe them in action. 

Teachers developed learning materials with an increased level of system 
approach and suitability to their students. They incorporated some new 
elements of assessment into the activities that followed the case studies. The 
assessment of teachers’ projects has shown that the activities they proposed 
towards the end of the second year increased in complexity and required the 
students to exhibit higher order thinking skills, such as argumentation, value 
judgment, and critical thinking. It should be noted that these results involve 
several factors, including individual vs. group processes, long-term (two- 
year) learning, assessment methods, and relative criteria weight in peer- and 
classroom assessments. Future studies may be able to address each of these 
factors separately, but as experience has shown this is hardly feasible in 
education. 



7. DISCUSSION 

The project-based assessment framework has emerged as a common thread 
throughout the three studies described here. This framework is holistic, in 
the sense that it touched upon domains, activities and aspects that both 
students and teachers experience. Researchers and educators attribute many 
benefits to project-based assessment schemes. Among these benefits are the 
generation of valid and reliable information about student performance, 
provision of formative functions, and promotion of teacher professionalism 
(Black, 1998; Nevo, 1994; Dori & Tal, 2000; Worthen, 1993). As Sarason 
(1990) wrote, the degree of responsibility given to students in the traditional 
classroom is minimal. They are responsible only in the sense that they are 
expected to complete tasks assigned by teachers and do so in ways the 
teachers have indicated. They are not responsible to other students. They are 
solo learners and performers responsive to one adult. The opposite is true for 
the project-based framework, into which the new modes of assessment were 
woven in a way that constitutes a natural extension of the learning itself 
rather than an external add-on. In establishing the assessment model, we 
adopted lessons of Nevo (1995) in school-based evaluation, who noted that 
outcomes or impacts should not be the only thing examined when evaluating 
a program, a project, or any other evaluation object witbin school. 
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Departing from the traditional school life-learning environment, the project- 
based approach resembles real life experience (Don & Tal, 2000; Mitchell, 
1992a). Students are usually enthusiastic about learning in project-based 
settings, they apply inquiry skills and deal with complexity while using 
methods of scaffolding (Krajcik, Blumenfeld, Marx, Bass, Fredricks, & 
Soloway, 1998). In the project-based learning described in this chapter, 
students were responsible for their own learning, teachers oversaw student 
teamwork, and community stakeholders were involved in school curriculum 
and assessment. Participants eagerly engaged in the learning process with 
emotional involvement, resulting in meaningful and long-term learning. In a 
school environment like this, higher order thinking skills and autonomous 
learning skills develop to a greater extent than in traditional learning settings 
(Dori, 2003). 

Black (1995, 1998) argued that formative assessment has much more 
potential than usually experienced in schools, and that it affects learning 
processes in a positive way. The assessment types presented in this chapter 
as the project-based framework increases the variety of assessment models. 
One advantage of this type of educational orientation is that students and 
teachers collaborate to create a supportive learning environment, which is in- 
line with the knowledge building in a community of learners (Bereiter & 
Scardamalia, 1993). The project-based curriculum and its associated 
assessment system required time and effort investment by the teachers and 
students alike. Yet they accepted it, as they recognized the value of the 
assessment as an on-going process, integrated with the learning. 

The project-based assessment framework that has emerged from the studies 
presented here is multi-dimensional in a number of ways: 

• The assessed objects are both the individual student (or teacher) and the 
team; 

• External experts, teachers, and students carry out the assessment; 

• The assessing tools are case studies, projects, exhibition, portfolios and 
self-assessment questionnaires. 

Despite its complexity, this assessment was meaningful and suitable to a 
variety of population types. 

In all the three project-based studies, students achieved scores in the high- 
level assignments that were lower than those achieved in the low-level 
assignments. This is consistent with finding of other researchers (Harlen, 
1990; Lawrenz et ah, 2001) who showed that open-ended assignments are 
more difficult and demanding, because they measure more higher order 
thinking skills and because students are required to formulate original 
responses. Open-ended, high-level assignments provide important feedback 
that is fundamentally different in nature than what can be obtained from 
assignments that are defined as low-level ones. The high level assignments. 
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developed as research instruments for these studies, required a variety of 
higher order thinking skills and can therefore serve as a unique diagnostic 
tool. 

Following the recommendation of Gitmore and Duschl (1998) and of 
Treagust et al. (2001), in the “Matriculation 2000” Project teachers improved 
students’ learning outcomes and shaped curriculum and instruction decisions 
at the school and classroom level through changing the assessment culture. 
The reform that took place in the 22 high schools is a prelude to a transition 
from a nationwide standardized testing system to a school-hased assessment 
system. Moreover, teachers, principals, superintendents, and Ministry of 
Education officials who were engaged in this Project became involved in 
convincing others to extend the Project boundaries to additional courses at 
the same school and to additional schools in the same district. 

The study that involved teachers has shown that projects can serve as a 
learning and assessment tool not only for students but also for teachers. 
Hence, incorporating project-based assessment is recommended for both pre- 
and in-service teacher workshops. Introducing teachers to this method will 
not only serve as assessment means to evaluate the teachers, but will also 
expose them to new modes of assessment and encourage them to implement 
it in their classes. Relevant stakeholders in the Israeli Ministry of Education 
have recognized the significance of the research findings of these studies and 
others carried out in other academic institutes. They realize the value of 
project-based learning and the new modes of assessment framework. 
However, economical constrains have been slowing down its adoption on a 
wider scale. 

The main limitation of these studies stems from the scale up problem. It is 
difficult to implement project-based assessment with great numbers of 
learners. If we believe that assessment modes as described in this chapter 
ought to be applied in educational frameworks, we need to find efficient 
ways to alleviate teachers' burden of following, documenting and grading 
students' project portfolios. The educational system should take care of 
adequate compensation arrangements that would motivate teachers to carry 
on these demanding assessment types even after the initial enthusiasm has 
diminished. Pre-service teachers can be of significant help in this regard. 

The findings of these studies clearly indicate that project-based assessment 
when embedded throughout the teaching process has the unique advantage 
of fostering and assessing higher order thinking skills. These conclusions 
warrant validation through additional studies in different settings and in 
various countries. 
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1. INTRODUCTION 

It is widely accepted that to an increasing extent, successful functioning 
in society demands more than being capable of performing the specific tasks 
a student learned to perform. Society is characterized by continuous, 
dynamic change. This has led to major shifts in the conception of the aim of 
education. Bowden & Marton (1998) expressed this change as the move 
from “learning what is known” towards “educating for the unknown future”. 
As society changes rapidly, there will be a growing gap between what we 
know at this moment and what will be know in the coming decade. Within 
this context, what is the sense for students to consume encyclopaedic 
knowledge? Bowden & Marton advocate that cognitive, meta-cognitive and 
social competencies are required, more than before. Birenbaum (1996, p. 4) 
refers to cognitive competencies such as problem solving, critical thinking, 
formulating questions, searching for relevant information, making informal 
judgments, efficient use of information, etc. The described changes are 
taking place with increasing moves towards what is called powerful learning 
environments (De Corte, 1990). They are characterized by the view that 
learning means actively constructing knowledge and skills based on prior 
knowledge, embedded in contexts that are authentic and offer many 
opportunities for social interaction. Feltovich, Spiro, and Coulson (1993) use 
the concept of understanding to describe the main focus of the current 
instructional and assessment approach. They define understanding as 
“acquiring and retaining a network of concepts and principles about some 

119 

M. Segers et al. (eds.), 

Optimising New Modes of Assessment: In Search of Qualities and Standards, 1 19 - 140 . 

© 2003 Kluwer Academic Publishers. Printed in the Netherlands. 




120 



Mien Segers 



domain that accurately represents key phenomena and their 

interrelationships and that can he engaged flexibly when pertinent to 
accomplish diverse, sometimes novel objectives.” (p. 181). Examples of 
powerful learning environments include prohlem-hased learning, project- 
oriented learning and product-oriented learning. 

With respect to assessment, a variety of new modes of assessment are 
implemented, with performance assessment and portfolio-assessment as two 
well-known examples. Characteristic of these new modes of assessment is 
their “in context” and “authentic” nature. Authentic refers to the type of 
cognitive challenges in the assessment. Assessment tasks are defined as 
authentic when “the cognitive demands or the thinking required are 
consistent with the cognitive demands in the environment for which we are 
preparing the learner” (Savery & Duffy, 1995, p.33). However, authentic not 
only refers to the nature of the assessment tasks, hut also to the role of 
assessment in the learning process. The assessment tools are tools for 
learning. Teaching and assessment are both seen as tools to support and 
enhance students’ transformative learning. Assessment is a valuable learning 
experience in addition to allowing grades to be assigned (Birenbaum & 
Dochy, 1996; Brown & Knight, 1994). In this respect, the formative function 
of assessment is stressed. Finally, the shift from teacher-centred to student- 
centered education has its impact in the division of responsibilities in the 
assessment process. To a growing extent, the student is an active participant 
in the evaluation process, who shares responsibility, practices self- 
evaluation, reflection and collaboration. 

As the criteria for assessment have changed, questions have been raised 
about the conceptions of quality criteria. As described in chapter 3, 
conceptions of validity and reliability encompassed within the instruments of 
assessment, have changed accordingly. One of these changes is the growing 
accent on the consequential validity of assessment. However, until now, only 
a few studies explicitly address this issue in the context of new modes of 
assessment. This chapter will present the results of validity studies of the 
OverAll Test, indicating the value-added of investigating the consequential 
validity of new modes of assessment. 



2. THE OVERALL TEST AND THE PROBLEM 
BASED LEARNING ENVIRONMENT 

The OverAll Test is a case-based assessment instrument, assessing 
problem-solving skills. With the implementation of the OverAll Test, it was 
expected that the alignment between curriculum goals, instruction and 
assessment would be enhanced. It was expected that by assessing students’ 
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skills in identifying, analysing, solving and evaluating novel authentic 
problems, it would stimulate the problem-solving process in the tutorial 
groups. Now, it has been implemented in a wide variety of faculties at 
different Belgian and Dutch institutions of higher education. Most of them 
have adopted a problem-based or project based instructional approach and 
they have attempted to optimise the learning effects of their instructional 
innovations by changing the assessment practices. 

2.1 The Problem-Based Learning Environment 

One of the aims of problem-based learning is to educate students who are 
able to analyse and solve problems (Barrows, 1986; Engel, 1991; Poikela & 
Poikela, 1997; Savery & Duffy, 1995). Therefore, the learning process is 
initiated and guided by a sequence of a variety of problem tasks, which 
cover the subject content (Nuy, 1991). During the subsequent years of study, 
these problem situations become more complex and diverse in activities to 
be undertaken by the students (e.g. from writing an advice for a simulated 
manager in problems to discussing in a live setting the proposal with a 
manager from a specific firm). Working in a small group setting (10 to 12 
students), guided by a tutor, students analyse the problem presented and 
discuss the relevant aspects of the problem. They formulate a set of learning 
objectives based on their hypotheses about possible conceptualisations and 
solutions. These objectives are the starting point for students in order to 
process the subject matter in study books. In the next group sessions, the 
findings of the self-study activities are reported and discussed. Their 
relevance for the problem and for novel but similar problems is evaluated. 

The curriculum consists of a number of instructional periods, called 
blocks. Each of them has one central theme, operationalized in a set of 
problems. 

2.2 The Over All Test 

The OverAll Test is a case-based assessment instrument. The cases are 
not of the key-feature format (Des Marchais & Dumais, 1993) but describe 
the problem situation in an authentic way, i.e. with for the problem situation 
relevant and irrelevant elements. This implies that for most cases, in order to 
understand and analyse the problem situation, knowledge from different 
disciplines has to be mastered. The test items require students to define the 
problem, analyse it, contribute to its solution and evaluate the solutions. 
They do not ask to tackle the entire problem situation presented in each case 
but refer to its critical aspects. The cases present novel problems, asking the 
students to transfer the knowledge and skills they acquired during the 
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tutorials and to demonstrate the understanding of the influence of contextual 
factors on problem analysis as well as on problem solving. For some test 
items, students are asked to argue their ideas based on various relevant 
perspectives 

For generalizability purposes (Shavelson, Gao, & Baxter 1996), the 
Over All Test items refer to a set of cases. 

In summary, the OverAll Test can be described by a set of characteristics 
(Segers, 1996): 

• it is a paper-and-pencil test; 

• it is part of the final examination; 

• it presents a set of authentic cases that are novel to the students (this 
means they were not discussed in the tutorial groups); 

• the test items require from the students that they identify, analyse, solve 
and evaluate the problems underlying the cases; 

• the cases are multidisciplinary; 

• two item formats are used: multiple-choice questions and open-ended 
questions. 

• it has an open-book character; 

• the test items refer to approximately seven different authentic cases 
(each about 10 to 30 pages) that are available for the students from the 
beginning of the instructional period; the test items related to the cases 
are given at the moment of test administration. 

Figure 1 presents examples of OverAll Test items. The test items refer to 
the Mexx and Benetton case (30 pages). The case study presents the history 
and recent developments in the fashion companies Mexx and Benetton. Main 
trends within the European clothing industry are described. Mexx Fashion 
and Benetton as a company are pictured by the organisational structure, the 
product profile and market place, the business system, the corporate culture 
and some actual facts and figures. 

The first question refers to the different viewpoints on management. 
Memorisation of the definition is not sufficient to answer the OverAll Test 
item. Students have to interpret the case and select the relevant information 
for this test item. On the basis of a comparison of this information with the 
conceptual knowledge of the different viewpoints on management, they have 
to deduce the answer. The second OverAll Test item refers to the concept of 
vertical integration. It requires students to take a set of mental steps to reach 
the solution of the problem posed. For the first part of the question (a), these 
can be schematised as follows: 
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1. Define the concept of corporate strategies. 

2. Select the relevant information for the Mexx company as descrihed in 
the case study. 

3. Confront it with the definition of the different possible strategies. 

4. Select the relevant information for Benetton as descrihed in the case 
study. 

5. Apply /choose a definition of the different possible strategies. 

6. Match the relevant information of both cases with the chosen definition 
of the strategies. 

7. Define for each company its strategy. 

8. Compare both strategies by going back to the definition of the strategies 
and the relevant information in the case study. 



Case Mexx/Benetton 

(This case is approx. 35 pages) 



The case study presents the history and recent developments in the fashion company 
Mexx. Main trends within the European clothing industry are described. Mexx 
Fashion and Benetton are pictured by their organizational structure, the product 
profile and marketplace, the business system, the corporate culture and some actual 
facts and figures. 

Question 1 

true/?/false Mexx’s corporate culture and philosophy is consistent with the systems 
viewpoint on management. 

(False, it is consistent with behavioural viewpoint) 

Question 2 

Benetton’s and Mexx’s corporate strategies are quite different. More specifically, 
there are two main differences. 

a. Identify these two main differences in corporate strategies. Illustrate your 
answer with examples mentioned in the case. 

b. What are the advantages of Benetton’s corporate strategy compared to Mexx’s 
approach? 



Figure I. Examples of OverAll Test Questions 

For the second part (b), students have to evaluate. Therefore, they have to 
take some extra mental steps: 

1. Understand the conditions to be efficient and effective for the different 
strategies. 

2. Select the relevant information on the conditions for both companies. 
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3. Interpret the actual conditions by comparison with those studied in the 

textbooks. 

This example illustrates the OverAll Test measures if students are able to 
retrieve the relevant concept (model, principles) for the problem. 
Furthermore, it measures if they can use these instruments for solving the 
problem. They measure if the knowledge is usable (Glaser, 1990) or if they 
know “when and where” (conditional knowledge). In short, the OverAll Test 
measures to what extent students are able to analyse problems and contribute 
to their solution by applying the relevant instruments. Figure 2 presents a 
second example of OverAll Test items based on an article on Scenario 
Planning. 



Question 1 

In his introduction, Schoemaker (1995) compares the method of scenario planning 
with other approaches such as contingency planning, sensitivity analysis and 
computer simulations. Hellriegel & Slogum (1996) textbook) give a similar 
comparison of three methods: scenarios, the Delphi technique and simulation. They 
stress that an overlap exists between these approaches, and indeed, it is not difficult 
to imagine how to use techniques like Delphi and simulation within Schoemaker’s 
framework of scenario planning. 

True/?/false The Delphi technique fits better in phase 3 of scenario planning 
(identifying basic trends) than in phase 9 of the scenario planning 
(develop quantitative models). 

Question 2 

The correlations in Table 3, Part B on p. 31 (Schoemaker, 1995) are nearly all 

positive, which makes the case rather specific Give a new example of scenario 

planning by solving the following tasks: 

1. Write down a hypothetical rorrelation matrix, having the same size as that in 
table 3, but the number of eMries with +, - and 0 more equally distributed 

2. Derive a scenario profile (as figure 1) from this correlation matrix, thereby 
given special attention to the existence of both positive and negative 
correlations. If necessary, make additional assumptions in order to find the 
profile. Start with one single scenario. 

3. Derive a second scenario profile, assuming this second scenario to be the 
reverse scenario of the first (reverse in its literal meaning; if the first scenario is 
something like recession, then the second is that of high economic activity). 

4. Give a description in words of the consistency requirements one has to observe 
in assignments b and c. 

5. Give an interpretation of the outcomes of the scenario profiles you constructed 
yourself. Schoemaker ends up with one scenario that performs best in all 
possible aspects, whilst the third scenario is the worst one, again in all possible 
aspects. Is the same true for the case you designed? 



Figure 2. Examples of OverAll Test questions 
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The Schoemaker test items ask to address the problem from different 
perspectives. They have to integrate knowledge of the disciplines of 
Statistics and Organisation and this within the context of scenario planning. 
Knowledge from both disciplines has to be used to tackle the problem of 
scenario planning. 



3. THE VALIDITY OF THE OVERALL TEST 

In 1992, a research project started aiming to monitor different quality 
aspects of the OverAll Test. One of the studies addressed the instructional 
validity of the OverAll test. With the changing notions of validity and based 
on different problems observed within the faculty, in 1999, a study was 
conducted to measure the consequential validity of the OverAll test. Both 
studies will be presented in the next session. 

3.1 The Instructional Validity of the OverAll Test 

McClung (1979) introduced the term “instructional validity” as opposed 
to the term “curricular validity”. As an answer to the question “Is it fair to 
the students to answer the assessment tasks?”, assessment developers mostly 
check the curricular validity, i.e. the match between the assessment content 
and the formal curriculum. The description of the formal curriculum is 
derived from the curricular objectives and the curriculum content. The 
formal curriculum is mostly expressed in the blueprint of the curriculum, 
describing the educational objectives in terms of content and level of 
comprehension. The assessment objectives are expressed in the assessment 
blueprint that is mirrored in the curricular blueprint. In interpreting test 
validity based on the match between the curriculum and the assessment 
blueprint, it is assumed that the instructional practice reflects the formal 
curriculum. Many studies indicate that this assumption may be questioned 
(Calfee, 1983; De Haan, 1992; English, 1992; Leinhardt & Seewald, 1981; 
Pelgrim, 1990). The operational curriculum, defined as what is actually 
learned and taught in the classrooms, can significantly differ from the formal 
curriculum as described in textbooks and syllabi. This is especially the case 
in learning environments were, more than in more teacher-centered settings, 
students are expected to formulate their own learning goals. How sure can 
we be that the test is fair to students who vary to some extend in the learning 
goals they pursue? 

Only a few studies in student-centred learning environments, such as 
problem-based learning, addressed this issue. Dolmans (1994) investigated 
the match between formal and operational curriculum and the relation 




126 



Mien Segers 



between attention given to learning goals during the tutorial groups and the 
students’ assessment outcomes. She concluded there was an overlap of 
64.2% (s= 26.7) between both curricula. The correlation between the time 
spent on the core concepts and the test items referring to these concepts, 
seemed to be significant but weak (r=.22, p<0.5, n=94). Probably it is the 
quality, more than the quantity, of the time spent on the core concepts that 
affects test scores. 

3.1.1 Research Method 



3.1. 1.1 Procedure 

The formal curriculum was described by analysing the textbooks, syllabi 
and tutorial manuals. The analysis resulted in a list containing more than 500 
detailed topics for each period. This extended list was screened by domain 
specialists to get a workable list. They constructed a hierarchical schema of 
the list of topics. The highest hierarchical levels of the networks of subjects 
are included in the final version; for example the concepts of “entry 
strategies”, “export”, “licensing” and “joint ventures”. They were all 
included in the draft version. In the final version only the concept of “entry 
strategies” was included. Thus, the list of central concepts was reduced to 
147 topics for the Marketing and Organization period and 136 topics for the 
Macro-economics period. The curricular validity is examined by comparing 
the formal curriculum with the test of the first instructional period. The list 
of concepts is compared with the list of objectives of the OverAll Test. 

To examine the instructional validity, two questionnaires were developed 
based on the lists of concepts. The questionnaires are a modified version of 
the Dolmans Topic Checklist (1994). The first Topic Checklist (TOCl) 
consists of the 147 topics in the disciplines Marketing and Organization. The 
TOC2 presented the 136 Macroeconomics topics. 

Students were asked to indicate whether the topic was discussed in their 
tutorial groups or not, by marking the topic or not. In order to gain some 
insight into the quality of the time spent on the topic, the second Topic 
Checklist (TOC 2) on Macroeconomics consisted of two additional 
questions. Students had to indicate the level of comprehension they believed 
they had reached. For every respondent, the number of topics mastered on 
each of the three levels of comprehension was counted. These levels were 
defined as the level of definition, the level of comprehension and the level of 
analysis. Mastery on the level of definition indicates the student is (only) 
able to reproduce the meaning of the concept as formulated in the textbooks. 
Comprehension of the topic implies that the student is able to define the 
concept in his own words, describe its relevance and its relation to other 
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concepts. To master a topic on the level of analysis would require the student 
to he able to apply the concepts when being confronted with a problem to be 
analysed. The staff members who developed the course were asked to 
indicate for each topic the intended level of comprehension. Finally, students 
were asked if a topic received much, moderate, or not much attention during 
the tutorial meetings. 

3.1. 1.2 Sample 

The sampling procedure employed in the study was that of the quota 
sample. The group of first year students was, for organizational reasons, 
divided into four groups. Two groups had their meetings in the morning, two 
groups in the afternoon. Students were equally selected from these four 
groups. For the TOC 1, 34 students participated voluntarily, for TOC 2, 45 
students. 

3.1.2 Results 

As the results in Table 1 indicate, there is an important amount of overlap 
between topics planned for study by the staff, and the topics indicated by the 
students as being subject of discussions and study during the instructional 
period. Table 1 indicates that on average 87% of the topics of TOC 1 and 
77.4% of the topics of TOC 2 have been subject of study (RT).Other studies 
investigating the match between the formal and the operational curriculum in 
a Problem-Based Learning- setting (Dolmans, 1994), show an overlap of 
64.2%. 

Students perceived they had mastered on average 47% of the topics of 
TOC 2 on the level of comprehension, i.e., that they were able to explain in 
their own words the meaning of the topics, their relevance and their relation 
to other concepts. For an average of 31% of the topics, students stated they 
were able to use these topics for the analysis of problems (level of analysis). 
For 22% of the topics, on average, students indicated they had mastered 
them on the level of definition, i.e., that they were able “only” to reproduce 
the definition. The correspondence with the aims of the staff is considerable. 

Concerning the curricular validity, for the OverAll Test, 11% and 15% of 
the test items refer to topics that were not part of the formal curriculum. This 
means they were missing in the Topic Chec kl ist I. Comparing topics that had 
either been or not been discussed (RT/NRT) with test items content, none of 
the topics which were indicated as not having been subject of the discussions 
by more than 29% of the students (percentile 25) were part of the OverAll 
Test. This result suggests high instructional validity of the OverAll Test. 
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Table I. The Degree of Overlap Between the Formal and the Operational Curriculum 



Variables 


Mean 


Standard Deviation 


n 


RTl* 


87% 


17,33 


34 


NRTl 


12% 


15,67 


34 


RT2 


77,4% 


12,64 


45 


NRT2 


22,6% 


15,67 


45 


Definition 


22,l%(student) 

20,6%(siaf0 


21,24 


45 


Comprehension 


47% (student) 
40,4% (stafO 


22,64 


45 


Analysis 


30,9% (student) 
39% (sUfO 


16,58 


45 



RT; Recognized Topics ( I = TXX: 1 . 2= T(X:2) 
NRT: Not Recognized Topics 



Additionally for TOC2, the more topics students indicate as “received 
much of attention during the meetings”, the higher their OverAll Test score 
(r= .40*). On the other hand, the more topics students indicate as “received 
moderate attention during the meetings”, the lower the OverAll Test scores 
are (r= -.32*). Probably, students acquired partial knowledge by the informal 
exchanges they had about the topic. This partial knowledge might impede 
instead of enhance successful problem analysis. There was only a very weak 
correlation between topics which received not much attention and the test 
scores (r= .01). 

3.1.3 Conclusion 

The instructional validity study suggests there is an important degree of 
overlap between the formal and operational curriculum, in terms of core 
concepts as well as in terms of the level of comprehension. Although the 
students followed different learning paths during the instructional period, 
students perceived they addressed the core concepts of the formal curriculum 
on the expected level of comprehension. It can be expected that in this case, 
with the assessment blueprint deduced from the curricular blueprint, the 
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assessment reflects instruction. The results of the validity study confirm this 
hypothesis. 

However, informal discussions with staff and students have raised doubts 
about the match between instruction and assessment. It was expected that the 
OverAll test, as a mode of assessment in alignment with the features of the 
learning environment, would have a positive effect on learning and teaching. 
This resulted in the study of the consequential validity of the OverAll Test. 

3.2 The Consequential Validity of the OverAll Test 

The growing implementation of new modes of assessment has been 
influenced by the high expectations teachers had about their positive effects. 
Ramsden (1988, p. 24) argued, “Perhaps the most significant single 
influence on students’ learning is their perception of assessment”. It was 
expected that a change of the assessment from a constructivist perspective 
would enhance students’ learning. Within educational measurement, the shift 
in conceptualisations of learning and assessment has influenced the 
developments in the philosophy of validity, as Moss (1992) describes. With 
the changing conceptions of validity in educational measurement, there is a 
growing interest paid to the multidimensionality of the concept of validity 
and the relevance of those dimensions for new modes of assessment (Moss, 
1992). Linn, Baker and Dunbar (1991) describe the relevance of the 
consequences of assessment. They define those consequences as “the 
intended and unintended effects of assessment on the ways teachers and 
students spend their time and think about the goals of education.” (p. 17). 
However, as Thomson and Palchikov explain (1998), one element in the 
assessment research that has received less attention than others is that of the 
students’ perceptions of it. Only a few studies addressed this issue (e.g. 
Gibbs, 1999, Sambell, McDowell & Brown, 1997; Thomson & Palchikov, 
1998). The results of these studies as regards new modes of assessment are 
conclusive. Gibbs (1999) showed in a series of case studies that “students are 
tuned in to an extraordinary extent to the demands of the assessment system 
and even subtle changes to methods and tasks can produce changes in the 
quantity and in the nature of learning outcomes out of all proportion to the 
scale of change in assessment.”(p. 52). Based on a series of interviews with 
students experiencing new modes of assessment, the research of Sambell, et 
al (1997) revealed the impacts of assessment on learning. “Their perceptions 
of poor learning, lack of control, arbitrary and irrelevant tasks in relation to 
traditional assessment contrasted sharply with perceptions of high quality 
learning, active student participation, feedback opportunities and meaningful 
tasks in relation to alternative assessment” (p. 365). 
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In both studies, there is not much information on the learning 
environment of which the assessment is part. One can hypothesize that, in 
the case of new modes of assessment, an alignment between instruction and 
assessment will enhance the positive effects of assessment on learning. 
Additionally, the studies describe perceptions with assessment methods that 
are relatively new for students and staff. Sambell et al. (1997) indicate that 
for 5 of the 13 cases, the assessment approach was being used for the first 
time. In only 4 of the 13 cases, the assessment method was considered 
typical for that course. Probably the observed effects of the implementation 
of new modes of assessment are partly due to the innovative character of 
these assessments. 

The case described in this chapter, differs from the previous studies in 
two aspects. First, at the time of the study, the OverAll Test had been 
implemented for 8 years. Students and teachers are very familiar with its 
characteristics and purposes. Second, it is developed according to the 
features of the problem-based learning environment of which it is part. The 
instructional validity presented earlier, indicated an acceptable alignment 
between the formal and operational curriculum and the OverAll Test. It was 
expected that the OverAll test would enhance learning as a construction of 
knowledge in order to identify, analyse, solve and evaluate novel, authentic 
problems. 

3.2.1 Research Method 



3.2.1.1 Procedure 

An evaluation questionnaire was developed to measure students’ 
perception of different quality aspects of the learning environment. The 
questionnaire items were based on interviews with staff members asking for 
their expectations about the students’ study processes. For all the items in the 
questionnaire, they expected average scores on the 5-point Likert scale 
higher than 3.5 and with low standard deviations (<0.5). As a measure of 
reliability, i.e. internal consistency, the Cronbach Alfa coefficient was 
calculated. In the different surveys, the coefficient varies between .50 and 
.63. This indicates a moderate reliability of the instrument. 

In order to complement the student surveys and in order to gain a deeper 
insight in students’ perceptions of the OverAll Test, semi -structured 
interviews were held with groups of students. For reasons of between- 
respondent triangulation, we additionally interviewed a group of teachers. As 
Sambell et al (1997, p. 355) indicate: “The act of soliciting the varying 
perspectives of the range of people involved in the assessment process was 
crucial in building up a rich, fully contextualized picture of the phenomenon 
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- alternative assessment - under investigation.” The aim was to encourage 
the students and the teachers to talk freely and openly about their 
experiences with the OverAll Test, with interviewers providing initial 
stimuli, using an interview schedule (Powney & Watts, 1987). The semi- 
structured interview schedules of Sambell et al (1997) were adapted to the 
Maastricht situation and used for the student interviews as well as for the 
teacher interviews. Contrary to the design of former questionnaires and 
interview schedules measuring the effect of conventional assessment on 
learning, the Sambell et al interview schedules were developed to explore 
student perceptions of the consequential validity of new modes of 
assessment. The interviews focused on “what students understood to be 
required, how they were going about the assessment task(s), and what kind 
of learning they believed was taking place” (Sambell et al, 1997, p. 355). 
These questions were the core of the OverAll test study presented here. 

Based on the rationale of the Focus Group method, the interviews were 
conducted in a group session and not in a one-on-one session. The 
participants in the student groups and in the teacher group were asked to 
discuss their perceptions of different aspects of the OverAll Test. By asking 
the stakeholders to discuss these issues with peers, the Focus Group Method 
intends to generate richer information than it is the case with individual 
interviews (Churchill, 1996). 

The kind of data obtained included students’ and staff members’ detailed 
descriptions of how they perceived the OverAll Test. In addition, the data 
reflect the students’ and staff’s more general reflections and ideas about 
assessment. 

The interview data for analysis were grouped in themes and structures. 
The validity of the interpretations rests upon careful reflection and 
discussions with researchers not involved in the interviews. Therefore, we 
discussed the interpretations with a team of 3 educational scientists and one 
economist. Additionally, we searched for confirmatory and contrary 
evidence in order to strengthen the development of interpretations by a round 
table discussion with a random sample of the interviewed staff members and 
students. 

3.2,1,2 Sample 

For the survey, all students administering the test were asked to answer 
the questionnaire. The response rate varied between 50% and 70%. For the 
Focus Group, the interviews were conducted with 5 student groups and a 
staff member group (n=8). Within a random sample of 5 tutorial groups (n = 
12), students were asked to volunteer. In total 48 students participated. 
Teachers who were engaged in the OverAll test for many years were asked 
to cooperate. In total 8 staff members participated. 
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3.2.2 Results 



3.2.2.1 The Student Survey 

The results revealed some notable phenomena. Table 1 summarizes the 
average scores of the first year students (n=100, academic year 1998-1999) 
on a set of items from the questionnaire (5-points Likert scale). The data are 
consistent over the past years. 



Table 2. The Average Scores of Students on the OverAll Test Student’s Perception.s 
Questionnaire (OverAll Test 1 , 1 998- 1 999, Faculty of Kconomics and Bu.siness 
Administration) 



Self-study 

• I read the cases 

• I checked the concepts I did not understand 

• I indicated the key concepts in the cases 

• I made schemas of the content of the cases 

• I checked the topics of the cases in the learning materials 
of the module 

• I tried to formulate figures and tables in the case in my 
own words 

• When choices or decisions were made in the case, I tried 
to find arguments from the learning materials of the 
module 

• I tried to explain the data presented in the case on the 
basis of the learning materials of the module 

• I tried to understand the cases by asking why-questions 

• I made notes from the analyses I made 

• I worked on the cases together with peers 

• How many hours did you spend working on the cases? 
The OverAll Test 

• The way of working in the tutorial group fits the way of 
questioning in the OverAll Test 

• The test questions were more difficult than expected 

• For most questions I had to read the cases 

• I had enough time to answer the questions 



M(s) 



4.5 ( 1 . 0 ) 
4.1 ( 1 . 0 ) 
3.9 ( 1 . 1 ) 
2.8 ( 1 . 2 ) 

3.7 ( 1 . 1 ) 

3.6 ( 0 . 8 ) 



3.2 ( 0 . 9 ) 

3.8 ( 0 . 7 ) 
3.4 ( 1 . 0 ) 

3.3 ( 1 . 1 ) 
2.6 ( 1 . 7 ) 
27 . 0 ( 16 . 1 ) 



2.5 ( 1 . 0 ) 
4 . 1 ( 0 . 8 ) 
2.8 ( 1 . 0 ) 
2.9 ( 1 . 4 ) 



It is clear from the results in Table 1 that only reading the cases is a 
common activity of the students. This finding was quite disappointing for the 
staff. For many students, analysing in depth the cases was not part of their 
working activity on the cases. Although students had 2 weeks free of tutorial 
meetings in order to work on the cases, they spent less than half that time in 
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this activity. During the early years, the staff tried to motivate and guide the 
students more hy giving more concrete study guidelines together with the 
cases. This did not change the results of the questionnaire significantly. It 
was especially the answer to the question “The way of working in the 
tutorial group fits the way of questioning in the OverAll Test” which 
surprised the staff. Although a former validity study suggested instructional 
validity (see above), students did not seem to perceive a match between the 
processes in the tutorial group and the manner of questioning in the OverAll 
Test. Particularly because working on problems is the main process within 
problem based learning environments, the staff considered this students’ 
perception as a serious issue. 

3.2,2,2 The Semi-Structured Interviews 

Different issues were addressed. First, the students and the staff 
expressed their views on the concept of the OverAll Test and its relationship 
with other assessment instruments. Second, views on the relation between 
instructional practice and assessment practices were expressed. Third, views 
on the way of working during the self-study period were considered. Fourth, 
views on how the assessment practices can be optimised were explored. 

3.2,2.2.1 The Concept of the OverAll Test and the Knowledge Test 

The students and the staff described the OverAll Test with two 
characteristics: the level of competence measured and the domain 

questioned. 

The students explained the goal of the OverAll Test as measuring the 
application of knowledge. As Sebastian said: “in the OverAll Test you have 
to use knowledge in practice”. Thomas explained it as follows: “The 
OverAll Test asks you to use knowledge; you need to do more than for the 
Knowledge Test. For the Knowledge Test you read the textbooks and study 
it by heart. For the OverAll Test, you have to relate things; you have to cope 
with the context where knowledge is to be used. The OverAll Test is 
building knowledge, the Knowledge Test is memorising.” Stephanie, a tutor, 
used the term “the linking of knowledge to practice.” 

Concerning the domain characteristic, the students perceived the OverAll 
Test as asking the students to connect theories. Tobias said: “The OverAll 
Test is a summary of two blocks (instructional periods). It checks if you 
remembered the basics of two blocks.” Rene, a tutor, described the OverAll 
Test as a kind of test, where you progressively measure to what extent 
students are able to use knowledge from different instructional periods.” 
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3.2.2.2.2 Match between Instruction and Assessment 

The staff as well as the students indicated that, in theory, the transfer of 
the problem-solving skills from the tutorial group to the self-study period 
and the OverAll Test should be “a natural step”, as Peter said. However, they 
experienced the tutorial group as too much stressing the reproduction of 
what was read in the literature. The literature was supposed to be a tool for 
explaining, analysing and solving the problem posed. In practice, the 
problem was often used only as a starting point for going to literature. From 
that moment on, the literature became a goal instead of a tool. The students 
experienced this process as highly influenced by the skills of the tutor. The 
staff formulated three reasons for the “reproductive” functioning of some 
tutorial groups: the skills of the tutor, the amount of concepts that are 
addressed in the instructional period, and the motivation of the students. 
Mark, a staff member said: “It is the task of the tutors to help the students to 
understand the context of what they learn. The motivation of the student 
influences the extent to which this happens. At the time, there is too much 
reproduction of knowledge in the tutorial group and too little creativity. This 
can also be seen in the students’ answers on the OverAll Test items. They 
reproduce knowledge that is relevant for the problem analysis questions, but 
they do not link the theory to the case. But, can we require creativity from 
the students if we did not pursue it during the tutorials?” Maarten, a 
colleague added: “Too many subjects are planned within the curriculum. 
There is no time left for discussion and for going back to the problem.” 

The students asked for more exercises and discussion. Kurt, a student, 
stated this point very clear: “How should you be able to analyse things if you 
have to deal with 19 chapters within 6 weeks? There the problem-based 
learning system fails”. He added, “within the tutorials, the graphs were less 
complex and mostly they were drawn in the book and you had to interpret 
them. In the OverAll Test, you had to draw the graphs yourself.” The 
students also indicated that they had problems with the novelty of the 
problems. During the tutorials, the learning took place based on a problem. 
Discussions of the key issues based on similar problems or problems with 
slight variations on the starting problem, seldom took place. “During the 
OverAll Test, you suddenly have to deal with novel problems. Sometimes 
those problems questioned in the OverAll Test present variations which are 
difficult to work with,” the students concluded. 

Another aspect discussed was the feedback on what students know. 
Because some of the tutorial groups only end up with summarizing the 
literature without discussing in depth the relevance to the problem, there was 
no real feedback on student’s understanding of the key issues. This feedback 
occurs when the students start discussing what they found in literature and 
its relation to the problem as presented in the case. Improving the tutorial 
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means improving this feedback function, according to the students and the 
staff. Additionally, both expressed the need for feedback on the test results. 
Then, the real learning starts. 

In order to improve feedback and to get the first year students acquainted 
with the OverAll Test, in the middle of the first block, students received a 
novel case tackling issues studied during the past three weeks based on 
different problem tasks. The students were asked to answer a set of problem 
analysis questions based on this case, which were similar to OverAll Test 
items. The answers to these questions as well as the problem-analysis 
process were discussed within the tutorial group. The students felt this was 
relevant to do and they encouraged this way of giving feedback and 
exercising. However, because the OverAll Test was at the end of the next six 
weeks block, they perceived it as “not primarily relevant for the moment”. 

These perceptions suggest that because of curricular as well as tutor 
problems, the tutorial groups did not sufficiently succeed in relating 
knowledge to practice. This is mirrored in the problems students face when 
analysing the cases that are subject of the OverAll Test. Additionally, the 
feedback function of the tutorial group failed to a certain extent. The 
feedback function was overwhelmed by the reproduction of a large amount 
of content matters relevant to the starting problem. Moreover, the explicit 
feedback moment during the block was experienced as only one single 
moment of exercising, inappropriately planned in the curriculum. The 
students expressed the need to do this exercise as flexible problem analysis 
as part of all blocks. 

3.2,2,2,3 Students’Activities 

The staff expressed the feeling that the students do not work enough 
during the self-study period. In some cases, the students only read the 
articles. The students agreed that for some of them, reading the cases, 
sometimes, was the only activity. Other students formed a small tutorial 
group themselves and discussed the cases. The students of the group 
indicated they experienced this process as very effective. “It drives you to 
think critically about what you found” David said. 

The students mentioned different reasons for not working full-time 
during the self-study period. Sometimes, they experienced the cases as not 
interesting. “Sometimes, especially when the cases are long, you do not 
know what to do with it “, Dirk said. “For the OverAll Test, you need to do 
more. But what this “doing more” exactly means, is not clear.” Some 
students expressed that they did not know how to start working. If they read 
the cases and checked the relevant issues in the literature, what should they 
do more? 
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This feeling of being unsure how to handle the cases mirrors the 
problems of the group functioning mentioned earlier. As in some tutorial 
groups, the process sticks after the content matters are reproduced. The 
going back from theory to practice in order to better understand practice, was 
missing. 

Finally, the students referred to the minor weight of the OverAll Test in 
the final examinations. If it had more weight, they surely would put more 
energy into it. 

3.2.2,2,4 Optimising Assessment Practices 

The concept and the relevance of the OverAll Test are largely accepted. 
The students as well as the staff members indicated the OverAll Test is an 
inherent aspect of the problem-based learning curriculum. It is instructional 
practice that fails to some degree. 

According to the problems expressed, students as well as staff members 
asked for more feedback, more time for discussion and for using knowledge 
as a tool for problem-solving, and for more skilled tutors. 



4. CONCLUSION 

As the core goals and features of the learning environment changed 
during the last decade, questions were posed as to what extent and in what 
sense assessment of students’ performances should be adapted to these new 
directions in learning and instruction. This has led to the expanding 
implementation of new modes of assessment. The development of the 
OverAll Test in a problem-based learning environment is an example of 
these changes in instructional and assessment approach. 

The shift in conceptions of learning, teaching and assessment, together 
with the expansion of the implementation of new modes of assessment, has 
led to shifting conceptions of validity in educational measurement. Although 
many debates have been going on, only a few research studies address the 
quality of new modes of assessment from this new perspective. 

The present study indicates the value-added of looking for evidence of 
multiple dimensions of validity of new modes of assessment. Concerning the 
match between instruction and assessment, it seemed that the OverAll Test 
measured the concepts at the level of comprehension, perceived by the 
students as part of instruction. However, one important concern was the 
intended and unintended effects of the OverAll test on the way students and 
teachers spend their time and think about the goals of education and, in 
particular, assessment. The results of a yearly student survey and of the 
semi-structured interviews with groups of students and teachers have led to 
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the following conclusions. Students as well as teachers agreed largely on 
what they understood to be required by the OverAll Test: students have to 
apply knowledge; they have to cope with the context where knowledge is to 
be used. The OverAll Test is about building knowledge. They both 
recognised that the OverAll Test requires accessing previously acquired 
knowledge in a new set of contexts. Taking into account these 
characteristics, students as well as the staff perceived the OverAll test as 
relevant, as an inherent aspect of the problem-based learning environment. 
Concerning how the students were going about the assessment tasks, it 
seemed that the students spend less than half of the time planned for working 
on the novel cases. The students indicated that to a large extent, they did not 
get further than reading the case and reading about the relevant concepts in 
the literature. This was mirrored in the way the students and the teacher 
perceive the learning that is taking place in the tutorial groups. In some 
cases, the tutorial groups ended up summarizing the theory that was relevant 
for the starting problem. There was no in-depth analysis of the starting 
problem and no transfer of the general concepts derived from this starting 
problem to similar problems. For some tutorial groups, using the theory as a 
tool for analysing and solving the problem was not the core practice. 
Students as well as staff referred to the overloading of the blocks and the 
problem of some tutors who were not skilled to stimulate the group to 
discuss, analyse and use the knowledge as a tool for solving a variety of 
problems. This too “reproductive” functioning of the tutorial group was a 
burden to achieve in-depth feedback on the learning process. 

4.1 Recommendations for Assessment and Instruction 

The findings reported indicate the importance of the alignment of 
instruction and assessment. In this respect, Biggs (1996) stresses the 
importance of constructive alignment, where students’ performances that 
represent a high cognitive level (understanding), are nominated in the 
objectives and thus used to systematically align teaching methods and the 
assessment. Under these conditions, new modes of assessment such as case- 
based assessment and the OverAll Test can influence student learning. 
However, the subjective learning environment, the way students perceive the 
learning environment, seems to play a mediating role. Although these new 
modes of assessment encourage study behaviour such as “building 
knowledge”, the quality of the learning environment as perceived by the 
students plays a crucial role in the extent to which students really engage in 
these kind of study behaviours. The research results presented here indicate 
that, when interpreting the results of consequential validity studies of new 
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modes of assessment, the perceived learning environment has to be taken 
into account. 

It can be concluded that the evaluation of the assessment practices lead to 
recommendations for improving instruction. It seems that, as students 
express, the burden is on the instructional practice and not on the assessment 
instrument. 

4.2 Recommendations for Future Research 

It is commonly stressed that “the tail wags the dog”: the assessment 
influences to a large extent what and how students learn. However, the 
findings on the consequential validity of the OverAll test, presented in this 
chapter, indicate that the relation between assessment and learning is more 
complex. The learning environment as perceived by the students, can 
mediate the effect of the assessment practices on learning and teaching. 
Researchers investigating the consequential validity of new modes of 
assessment should take this into account. 

In edumetrics, validity is stressed as a crucial quality indicator for new 
modes of assessment. The consequential validity study presented here is 
conducted within this edumetric framework. However, additional research is 
necessary to measure the various quality aspects of the OverAll Test. Four 
additional research questions can be formulated. 

Is the OverAll Test fair to all the students? Especially with respect to 
performance assessments, this question is often raised (Bond, 1995). Within 
the case presented, students from different nationalities with different 
learning experiences and different learning styles are working in small 
groups. To what extent is the OverAll Test fair for different subpopulations 
with respect to their nationality, prior experiences, learning styles and the 
tutorial group attended? 

The notion of predictive validity asks for a procedure of determining the 
extent to which this assessment instrument predicts accurately the 
performance of the students in their subsequent study careers (Benett, 1993). 
This issue can be operationalized as: do high performers on the OverAll Test 
perform better on the projects they do in graduate courses, and is there an 
effect on their entrance on the labour market? 

Finally, the question remains of whether the results of the studies 
reported are case-specific. Research in other settings, with other curricula 
and with other student populations where the OAT is implemented, can 
indicate how generalizable are the results obtained in this study. Comparing 
the consequential validity of the OverAll test in various curricula and 
learning environments can indicate under which conditions (i.e. learning 
environments) the OverAll test optimally stimulates learning as a 
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construction of knowledge in order to identify, analyse, solve and evaluate 
novel, authentic problems. 
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Only if we expand and reformulate our view of what counts as human intellect will we be able 
to devise more appropriate ways of assessing it and more effective ways of educating it. 
Howard Gardner 



1. INTRODUCTION 

Since the late 1980’s, public education in North America has been 
shifting to a standards or outcomes based and performance oriented systems. 
Within such systems, the most basic purpose of all education is student 
learning, and the primary purpose of all assessment is to support that 
learning in some fashion. The assessment reform that began in the 1980’s in 
North America has had numerous impacts. Most significantly, it has changed 
the way educators think about students’ capabilities, the nature of learning, 
the nature of quality in learning, as well as what can serve as evidence of 
learning in terms of classroom assessment, teacher assessment and large- 
scale assessment. In this context, the use of portfolios as a mode of 
assessment has gained a lot of interest. This chapter will explore the role of 
assessment in learning and the role portfolios might play. Research evidence 
of the qualities of portfolios for enhancing student learning is presented and 
discussed. 
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2. THE ROLE OE ASSESSMENT IN LEARNING 

Learning occurs when students are, “thinking, prohlem-solving, 
constructing, transforming, investigating, creating, analysing, making 
choices, organising, deciding, explaining, talking and communicating, 
sharing, representing, predicting, interpreting, assessing, reflecting, taking 
responsibility, exploring, asking, answering, recording, gaining new 
knowledge, and applying that knowledge to new situations.” (Cameron, 
Tate, Macnaughton, & Politano, 1998, p 6). The primary purpose of student 
assessment is to support this learning. Learning is not possible without 
thoughtful use of quality assessment information by learners. This is 
reflected in Dewey’s (1933) “learning loop,” Lewin’s (1952) “reflective 
spiral,” Schorl’s (1983) “reflective practitioner,” Senge’s (1990) “reflective 
feedback,” and Wiggin’s (1993) “feedback loop.” Education (K-12 and 
higher education) tends to hold both students and teachers responsible for 
learning. Yet, if students are to learn and develop into life long, independent, 
self-directed learners they need to be included in the assessment process so 
the “learning loop” is complete. Reflection and assessment are essential for 
learning. In this respect, the concept of assessment for learning as opposed to 
assessment of learning, has emerged. 

For optimum learning to occur students need to be involved in the 
classroom assessment process. When students are involved in the assessment 
process they are motivated to learn. This appears to be connected to choice 
and the resulting ownership. When students are involved in the assessment 
process they learn how to think about their learning and how to self-assess - 
key aspects of meta-cognition. Learners construct their own understandings 
therefore, learning how to learn - becoming an independent, self-directed, 
life long learner - involves learning how to assess and learning to use 
assessment information and insights to adjust learning behaviours and 
improve performance behaviours and improve performance. 

Students’ sense of quality in performance and expectations of their own 
performance are increased as a result of their engagement in the assessment 
process. When students are involved in their learning and assessment they 
have opportunities to share their learning with others whose opinions they 
care about. An audience gives purpose and creates a sense of responsibility 
for the learning which increases the authenticity of the task (Davies, 
Cameron, Politano, & Gregory, 1992; Gregory, Cameron, & Davies, 2001; 
Sizer, 1996). 

Students can create more comprehensive collections of evidence to 
demonstrate their learning because they know and can represent what 
they’ve learned in various ways to serve various purposes. This involves 
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gathering evidence of learning from a variety of sources over time and 
looking for patterns and trends. 

The validity and reliability of classroom assessment is increased when 
students are involved in collecting evidence of learning. The collections are 
more likely to be more complete and comprehensive than if teachers alone 
collect evidence of learning. Additionally, this increases the potential for 
instructionally relevant insights into learning. Teachers employ a range of 
methods to collect evidence of student learning over time. When evidence is 
collected from three different sources over time, trends and patterns can 
become apparent. This process has a history of use in the social sciences and 
is called triangulation (Lincoln & Guba, 1984). As students learn, evidence 
of learning is created. One source of evidence are products such as tests, 
assignments, students’ writings, projects, notebooks, constructions, images, 
demonstrations, as well as photographs, video, and audiotapes. They offer 
evidence of students’ performances of various kinds across various subject 
areas. Observing the process of learning includes observation notes 
regarding hands-on, minds-on learning activities as well as learning journals. 
Talking with students about their learning includes conferences, written self- 
assessments, and interviews. Collecting products, observing the process of 
learning, and talking with students provides a considerable range of evidence 
over time 

Taking these critical success factors of learning into account, portfolio as 
a mode of assessment poses unique challenges. 



3. PORTFOLIO AND ITS CHARACTERISTICS 

Gillespie, Ford, Gillespie, & Leavell, (1996) offers the following 
definition: “Portfolio assessment is a purposeful, multidimensional process 
of collecting evidence that illustrates a student’s accomplishments, efforts, 
and progress (utilising a variety of authentic evidence) over time.” (p. 487). 
In fact, portfolios are so purposive that everything that defines a portfolio 
system: 

1. what is collected; 

2. who collects it; 

3. how it is collected; 

4. who looks at it; 

5. how they look at it; and 

6. what they do with what they see. 

are all determined first by the purpose for the portfolio. 

Consider, for example, a portfolio with which one will seek employment. 
While there must be no duplicity in the evidence presented, it would seem 
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perfectly acceptable, even expected, that the candidates will present 
themselves in the best possible light. What is most likely to find its way into 
such a portfolio is a finished product - and often the best product at that. On 
the other hand, consider a portfolio with which a student reveals to a trusted 
teacher a completely balanced appraisal of his or her learning: strengths, 
certainly, but also we a knesses as well as the kinds of processes the learner 
uses to produce his or her work. This portfolio is likely to have a number of 
incomplete efforts, some missteps, and some product that reveals current 
learning needs. This is not the sort of portfolio with which one would be 
comfortable seeking employment. The point is a simple one: while they 
appear similar in many respects, portfolios are above all else purposive and 
everything about them derives from their desired purpose. This is why some 
frank discussion about purpose at the outset of developing a portfolio system 
is essential. Often, when teachers feel blocked about some decision about 
their portfolio process, the answer is apparent upon remembering their 
purpose. 

There is no one or best specific purpose for portfolios. Portfolios can be 
used to show growth over time (e.g. Elbow, 1986; Politano et ah, 1997; 
Tierney, Carter, & Desai, 1991), to provide assessment information that 
guides instructional decision-making (e.g.. Alter & Spandel, 1992; Gillespie 
et ah, 1996; LeMahieu & Eresh, 1996a), to show progress towards 
curriculum standards (e.g. Biggs, 1995; Gipps 1994; Erederiksen & Collins, 
1989; Sadler, 1989a), to show the journey of learning including process and 
products over time (e.g. Costa & Kallick, 2000; Gillespie et al, 1996) as 
well as used to gather quantitative information for the purposes of 
assessment outside the classroom (e.g. Anson & Brown, 1991; Eritz, 2001; 
Millman, 1997; Willis, 2000).The strengths of portfolios is that of range and 
comprehensiveness of evidence, variety and flexibility in addressing purpose 
(Julius, 2000). 

Portfolios are used successfully in different ways in different classrooms. 
Portfolios are generally defined in the literature in terms of their contents 
and purpose - an overview of effort, progress or performance in one or 
several subjects ( e.g. Arter & Spandel, 1992; Gillespie et ah, 1996; Herman, 
Aschbacher, & Winters, 1992). There are numerous examples of student 
portfolios developed to show learning to specific audiences in different 
areas. They are being used in early childhood classes (e.g. Potter, 1999; 
Smith, 2000), with students who have special needs (e.g. Law & Eckes, 
1995; Richter, 1997), and in elementary classrooms for Science (e.g. Valdez, 
2001) for writing (Howard & LeMahieu, 1995; Manning, 2000), and 
mathematics (Kuhs, 1994). Portfolios in high schools were used initially in 
performance-based disciplines such as fine arts, then in writing classes, and 
have now expanded to be used across many disciplines such as science 
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education (e.g. Reese, 1999), academic and beyond (e.g. Konet, 2001), 
chemistry classes (e.g. Weaver, 1998), English classes (e.g. Gillespie et al., 
1996), and music education (e.g. Durth, 2000). There is a growing body of 
research related to electronic portfolios (e.g. Carney, 2001; Quesada, 2000, 
Young, 2001; Yancey & Weiser, 1997). Portfolios are also being used in 
teacher-education programs and in higher education more broadly (e.g. 
Kinchin, 2001; Klenowski, 2000; McLaughlin & Vogt, 1996; Schonberger, 
2000 ). 

There is a range of evidence students can collect. Also, since there are 
different ways for students to show what they know, the assessment 
information collected can legitimately differ from student to student (see for 
example Anthony, Johnson, Mickelson, & Preece, 1991; Gardner & Boix- 
Mansilla, 1994) .Collecting the same information from all students may not 
be fair and equitable because students show what they know in different 
ways (e.g. Gardner, 1984; Lazear, 1994). When this assessment information 
about learning is used to adjust instruction, further learning is supported. 
Evidence of learning will also vary depending on how students represent 
their learning. Portfolios uniquely provide for this range of expression of 
learning. When they are carefully developed, they do so with evidence that 
can be of considerable technical quality and rigor. 

Prom an assessment perspective, portfolios provide at least four potential 
“values-added” to more traditional means of generating evidence of learning: 

1. they are extensive over time and therefore reveal growth and 
development over time (however simply or subtly the growth may be 
defined); 

2. they allow for more sustained engagement and therefore permit the 
examination of sustained effort and deeper performance; 

3. to the extent that choice is involved in the selection of content (both 
teacher and most especially student choice), then portfolios reveal 
students’ understandings about and dispositions towards learning 
(including the unique special purposes that portfolios might address and 
their consequent selection guidelines); and 

4. they offer the opportunity for students to interact with and reflect upon 
their own work. 

It is important to note that for portfolios to realise their potential as 
evidentiary bases for instructional decision-making, then particular attention 
must be given to some one (or all) of these four "values-added.” Not only 
should they serve as the focus for generating evidence uniquely beneficial to 
portfolios, but care must be taken in the construction and application of 
evaluative frameworks such that rigor and discipline attends the generation 
of data relevant to some one or all of these points. 
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Allowing for a range of evidence encourages students to represent what 
they know in a variety of ways and gives teachers a way to fairly and more 
completely assess the learning. Collecting information over time provides a 
more comprehensive picture. For example, Elbow and Belanoff (1991) 
stated, “We cannot get a trustworthy picture of a student’s writing 
proficiency unless we look at several samples produced on several days in 
several modes or genres” (p. 5). Portfolios may contain writing samples, 
pictures, images, video or audiotapes, work samples - different formats of 
evidence that helps an identified audience understand the student’s 
accomplishments as a learner. 

There are numerous ways students are involved in communicating 
evidence of learning as presented in portfolios. Some examples. Portfolios 
complement emerging reporting systems such as student, parent, teacher 
conferences (Davies et ah, 1992; Davies, 2000; Gregory et ah, 2001; 
Macdonald, 1982; Wong-Kam, Kimura, Sumida, Ahuna-Ka'ai, & Hayes 
Maeshiro, 2001). Sometimes students and parents meet at school or at home 
to review evidence of learning often organised into portfolios to show 
growth or learning over time (Davies et ah, 1992; Howard & LeMahieu, 
1995). Other times portfolios are used in more formal conference settings or 
exhibitions where students present evidence of learning and answer 
questions from a panel of community members, parents, and peers (Stiggins 
& Davies, 1996; Stiggins, 1996, 2001). Exhibitions are part of the graduation 
requirements in schools belonging to the Coalition of Essential Schools 
(Sizer, 1996). Sometimes students meet with teachers to present their 
learning and the conversation is between teacher and student in relation to 
the course goals (Elbow, 1986). This format appears more appropriate for 
older high school students and for graduate and under-graduate courses. In a 
few instances, portfolios have been developed (including student choice in 
their assembly) for evaluation and public accounting of the performance of a 
program, a school, or a district. (LeMahieu, Eresh, & Wallace, 1992b; 
LeMahieu, Gitomer, & Eresh, 1995a). This approach, when defined as an 
active process of inquiry on the part of a concerned community transforms 
accountability from a passive enterprise in which the audience is “fed” 
summary judgements about performance to an active process of coming to 
inspect evidence and determine personal views about the adequacy of 
performance and (even more important) recommending how best to improve 
it (Earl & LeMahieu, 1997a; LeMahieu, 1996b). All of these ways of 
communicating have one thing in common - the student is either present or 
actively represented and involved in presenting a range of evidence of 
learning. The teacher assists by providing information regarding criteria and 
evidence of quality. Sometimes this is done through using a continuum of 
development that describes learning over time using samples from large- 
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scale portfolio or work sample assessments. These samples provide a 
reference point for conversation about student development and 
achievement. Teachers use samples of work that represent levels of quality 
to show parents where the student is in relation to the expected standard. 
This helps respond to the question many parents ask, “How is my child 
doing compared to the other students?” These kinds of conferences involve 
parents and community members as participants in understanding the 
evidence and in “reporting” on the child’s strengths, areas needing 
improvement and the setting of goals. This kind of “verbal report card” 
involves students, parents, and the teacher in a face-to-face conversation 
supported with evidence. 



4. PORTFOLIOS AND THEIR QUALITIES 



4.1 The Reliability and Validity Issue 

When portfolios are used for large-scale assessment, concerns around 
their reliability and validity are expressed. For example, Benoit and Yang 
(1996), after using portfolios for assessment at the district level, recommend 
clear uniform content selection and judgement guidelines because of the 
need for greater inter-rater reliability and validity. Berryman and Russell 
(2001) indicates a similar concern for ensuring reliability and validity when 
he reports the Kentucky statewide rate for scoring the portfolios is 75% for 
exact agreement their school scoring has “86% exact agreement.” Resnick 
and Resnick, (1993) reported that while teachers refined rubrics and received 
training, it was a challenge to obtain reliability between scorers. Inter-rater 
reliability of portfolio work samples continues to be a concern (e.g. Chan, 
2000; Fritz, 2001; Willis, 2000). Fritz (2001; p. 32) “The evaluation and 
classification of results is not simply a matter of right and wrong answers, 
but of inter-rater reliability, of levels of skill and ability in a myriad of areas 
as evidenced by text quality and scored by different people, a difficult task at 
best.” Clear criteria and anchor papers assist the process. Experience seems 
to improve inter-rater reliability (Broad, 1994; Condon & Hamp-Lyons, 
1994; White, 1995; 1994b). 

DeVoge (2000), whose dissertation examined the measurement quality of 
portfolios, notes that standardisation of product and process led to acceptable 
levels of inter-rater reliability. Concerns regarding portfolios being used for 
gathering information across classrooms within schools, districts, and 
provinces/states are expressed particularly in regard to large scale portfolio 
assessment projects such as Kentucky and Vermont’s statewide portfolio 
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programs and Pittsburgh Public Schools (LeMahieu, Gitomer, & Eresch, 
1995a). Some researchers express concerns regarding reliability (e.g. Calfee 
& Freedman, 1997; Callahan, 1995; Gearhart, Herman, Baker, & Whittaker, 
1993; Koretz, Stecher, & Deibert, 1993; Tierney et al, 1991) while others 
point out the absence of certain controls as would give confidence even as to 
whose work is represented in the collected portfolio (Baron, 1983). Novak, 
Norman and Gearhart (1996) note that the difficulties stem from “variations 
among the project portfolios models, models that differ in their 
specifications for contents, for rubrics, and for methods for applying the 
rubrics.” (p. 6) Novak et al. (1996) examined techniques for assessing 
student writing. Raters were asked to score collections of elementary student 
narratives using holistic scales from two rubrics. Comparisons were based on 
three methods and results were mixed. One rubric gave good evidence of 
reliability and developmental validity. They sum up by noting that “if 
appropriate cut points are set then reasonably consistent decisions can be 
made regarding the mastery/non-mastery of the narrative writing 
competency of third grade students using any rubric-assessment 
combinations with one exception” (p. 30). 

Fritz (2001) names eight studies where researchers are seeing continued 
improvement in the quality of the data from portfolio assessments (p. 28). 
Fritz (2001) studied the level of involvement in the Vermont Mathematics 
Portfolio assessment in Grade 4 classrooms. In particular she was interested 
in whether involvement in the scoring process let to improved mathematics 
instruction. She explains that the student portfolio system requires a 
stratified random sample of mathematics that is centrally scored using 
rubrics developed in Vermont. The portfolio pieces are submitted on 
alternate years. In 1996 87% of schools that have Grade 4 students submitted 
mathematics portfolios. In 1996 teachers at 91 of 350 schools scored 
mathematics portfolios. She notes that the Vermont procedures have been 
closely examined with a view to improving the scoring (Koretx, Stecher, 
Klein, McCaffrey & Deibert, 1993). Current procedures are similar to those 
used in the New Standards Project (Resnick & Resnick, 1993). Individual 
teachers score student work. Up to 15% of the papers are double scored and 
those papers are compared to each other to check for consistency between 
scorers. 

Researchers reporting good levels of reliability in scoring performance 
assessments include Alter, Spandel and Culham, 1995; Gearhart, Herman, & 
Novak; 1994; LeMahieu, Gitomer, & Fresh; 1995. Herman (1996) 
summarises the issues relating to validity and reliability. She explains that 
while challenging, “assuring the reliability of scoring is an area of relative 
technical strength in performance assessment” (p. 13). Raters can be trained 
to score open-ended responses consistently. For example, Herman (1996) 
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reports the Iowa Tests of Basic Skills direct writing assessment demonstrates 
it is possible to achieve high levels of agreement with highly trained 
professional raters and tightly controlled scoring conditions (Herman cites 
Hoover & Bray, 1995). She goes on to note that portfolio collections are 
more complex and this multiplies the difficulty of ensuring reliability. 
LeMahieu, Gitomer and Eresh report reliabilities ranging from .75 to .87 and 
inter-rater agreement rates ranging from 87% to 98% for a portfolio system 
developed in a large urban school district. They go on to document the steps 
taken to ensure these levels of reliability (LeMahieu, Gitomer, & Eresh, 
1995a, 1995b). These included involving teachers in the inductive process 
that developed and refined the assessment frameworks (including rubrics) 
and drawing upon such development partners as model scorers; extensive 
training for all scorers (development partners as well as new scorers) that 
includes observation of critical reviews of student work by model scorers, 
training to an acceptable level of criterion performance for all scorers, using 
benchmark portfolios that are carefully selected as part of the development 
process to illustrate both the nature of performance at various levels as well 
as some of the more common issues in the appraisal of student work; 
constant accommodation processes during the scoring with adjudication of 
discrepant score as needed. 

Despite the positive research results concerning inter-rater reliability, 
Darling-Hammond (1997) after reviewing information regarding portfolio 
and work sampling large-scale assessment systems questioned whether they 
resulted in improvements in teaching and learning as well as whether or not 
they were able to measure quality of schooling. In this sense, Darling- 
Hammond, in line with the expanded view on validity in edumetrics, asks for 
more evidence for the consequential validity of portfolio assessment. To 
what extent is there evidence that portfolio assessment leads to the 
theoretically assumed benefits for learning? 

4.2 Do Portfolios Lead to Better Learning and 
Teaching? 

Student portfolios are usually promoted as a powerful instrument for 
formative assessment or for assessment for learning (e.g. Elbow & Belanoff, 
1986; Julius, 2000; Tierney et al., 1991). Portfolios are viewed as having the 
potential to allow learners (of all ages and kind) to show the breadth and 
depth of their learning (e.g. Berryman & Russell, 2001; Costa & Kallick, 
1995; Davies, 2000; Elood & Lapp 1989; Hansen, 1992; Howard & 
LeMahieu, 1995; Walters, Seidel & Gardner, 1994; Wolf, Bixby, Glenn, & 
Gardner, 1991). Involving students in every part of the portfolio process is 
critical to its success as a learning and assessment tool. Choice and 
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ownership, opportunities to select evidence and reflect what it illustrates 
while preparing evidence for an audience whose opinion they care about are 
key aspects of portfolio use in classrooms. Giving students choices about 
what to focus on next in their learning, opportunities to consider how to 
provide evidence of their learning (to show what they know), and to reflect 
and record the learning the evidence represents makes it more possible to 
learn successfully. Research examining the impact of the use of portfolio’s 
on students’ learning focuses on the impact of portfolios on learning in terms 
of students’ motivation, ownership and responsibility, feedback, and self 
reflection. 

4.3 Portfolios: Inviting Choice, Ownership and 
Responsibility 

When learners are engaged, they are more likely to learn. Researchers 
studying the role of emotions and the brain say that learning experiences 
such as these prepare learners to take the risks necessary for learning 
(Goleman, 1995; Jensen, 1998; Le Doux, 1996). Portfolios impact positively 
on learning in terms of increased student motivation, ownership, and 
responsibility (e.g. Elbow & Belanoff, 1991; Howard & LeMahieu, 1995; 
Paulson, Paulson, & Meyer, 1991). For example, Howard and LeMahieu 
(1995) report that when students in a classroom environment kept a writing 
portfolio during the school year and shared that portfolio with parents, the 
students’ commitment to writing increased and their writing improved. 
Researchers studying the role of motivation and confidence on learning and 
assessment agree that student choice is key to ensuring high levels of 
motivation (Covington, 1998; Stiggins, 1996). When students make choices 
about their learning, motivation and achievement increases, when choice is 
absent, they decrease (DeCharms, 1968; 1972; Deci & Ryan, 1985; Jensen, 
1998; Lepper & Greene, 1975; Maehr, 1974; Mager & McCann, 1963; 
Mahoney, 1974; Purkey & Novak, 1984; Tanner, 2000; Tjosvold, 1977; 
Tjosvold & Santamaria, 1977). Researchers studying portfolios found that 
when students choose work samples the result is a deeper understanding of 
content, a clearer focus, better understanding of quality product, and an 
ownership towards the work that “... created a caring and an effort not 
present in other learning processes” (Gearhardt & Wolf, 1995, p. 69). 
Gearhart and Wolf (1995) visited classroom at each of four school sites just 
before or just after students made their choices for their portfolios. They 
talked extensively with teachers and students, and collected copies of 
portfolios. Their project was designed to clarify questions about issues in the 
implementation of a portfolios assessment program. They noted that 
students’ choices influenced the focus of personal study, ongoing 
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discussions and the work of the classroom. The increased learning within 
those classes seemed to be related to students’ active engagement through 
choice. There was also a change in the student/instructor relationship which 
they report this relationship became more focused, less judgmental and more 
productive. They note that a balance is needed between external criteria used 
by the teacher and the internal criteria of the students and conclude by 
encouraging an on-going dialogue concerning assessment and curriculum 
amongst students, teachers, and assessment experts. 

Tanner (2000) examined the experience with writing portfolios in general 
education courses at Washington State University. Specifically, he examined 
issues such as history of the portfolio efforts, experience in light of research, 
impact on students. Since 1986 students have been required to submit a 
portfolio that includes three previously produced papers as well as a timed 
written exam. Later in their studies there is a requirement for a senior 
portfolio to be determined by the individual disciplines and to be evaluated 
by faculty from those same disciplines. Tanner notes that the literature 
references learning through portfolio use in terms of, “student attitude, 
choice, ownership, performance based learning, growth in tacit knowledge, 
and the idea of a climate of writing” (p. 83) In his conclusions he affirms 
that these same elements are present as a result of the portfolio work at 
Washington State University. Tanner (2000) writes, “... such personal 
investment, and ownership are the first steps in dialectic participation where 
ideas and knowledge are owned and remembered, a classic definition of 
learning.” (p. 59) “K-12 research shows connections between learning and 
such elements as choice and personal ownership of work, elements fostered 
by portfolio requirements. The connections between learning and broad- 
based portfolio assessment were clearly observed.” (Tanner, 2000; p. 79) 

Portfolios enrich conversations about learning. Portfolios have different 
looks depending on purpose and audience. The audiences for classroom- 
based portfolios include the students themselves, their teacher(s), parents, 
and sometimes community members or future employers. This enhances the 
credibility of the process. Portfolios encourage students to show what they 
know and provide a supportive framework within which learning can be 
documented. Using portfolios in classrooms as part of the organising and 
collecting of evidence prepares students to present their learning and to 
engage in conversation (sometimes in writing, sometimes orally or through 
presentations) about their learning. Julius (2000) asserts that knowing they 
will be showing portfolios to someone whose opinion they care about 
engenders “accountability and a sense of responsibility for what was in the 
portfolios” (p. 132). Willis (2000) notes, “This formal conferencing process 
challenges students to be more accountable to an authentic audience outside 
of their classroom and generally improves the quality...” (p. 47) 
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When individual student portfolios are scored and graded, the power of 
student choice, ownership, and responsibility may be diminished. Willis 
(2000) states that rewards and sanctions are... “antithetical to learner centred 
goals of true portfolio culture” p. 39 (for a discussion of the impact of 
rewards see Kohn, 2000). Willis (2000) refers to student, teacher and 
institutional learning after examining how Kentucky’s Grade 12 writing 
portfolios have influenced senior’s writing instruction and experiences, 
affected students’ skills and attitudes about writing, and influenced 
graduates’ transition to college writing. He collected data using exploratory 
surveys with 340 seniors, interviewing 10 students who graduated and 
continued on with their education at the college levels, and conducted a 
document analysis of writing portfolios produced in senior English classes as 
well as samples of writing submitted in college composition courses. Willis 
notes the self assessments demonstrated little awareness of the standards in 
effect in the learning environment. Willis (2000) reports that a statistical 
analysis of 340 students showed that students disregarded the worth of the 
portfolio process to the same extent they had been disappointed with the 
scores received. As a result, Willis (2000) recommends that students have 
more experience scoring their own work. Thome (2001) studied the impact 
of using writing criteria on student learning and found that when students 
were aware of the criteria for success their writing improved. Similarly, 
Y oung (2001) found that the use of rubrics motivate, lend encouragement to 
learners to improve, and provide a means for giving specific feedback. 

4.4 Portfolios: Feedback that Supports Learning 

When portfolio are accompanied by criteria that are written in language 
students can understand, describe growth over time, as well as indicate what 
is required to achieve success they can be used by students to guide their 
learning with on-going feedback as they create their portfolios. There is a 
vast amount of research concerning the impact of feedback on student’s 
learning. There is evidence that specific feedback is essential for learning 
(Black & Wiliam, 1998; Caine & Caine, 1991; 1999; Carr & Kemmis, 1986; 
Crooks, 1988; Dewey, 1933; Elbow, 1986; Hattie, in press; Sadler, 1989b; 
Senge, 1990; Shepard, 2000; Stiggins, 1996; Sylwester, 1995). Sutton (1997) 
and Gipps & Stobart (1993) distinguish between descriptive and evaluative 
feedback. Descriptive feedback serves three goals: 1) it describes strengths 
upon which further growth and development can be established; 2) it 
articulates the manner in which performance falls short of desired criteria 
with an eye to suggesting how that can be remediated; and 3) it gives 
information that enables the learner to adjust what he or she is doing in order 
to get better. Specific descriptive feedback that focuses on what was done 
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successfully and points the way to improvement has a positive effect on 
learning (Black & Wiliam, 1998; Butler, 1987, 1988; Butler & Nisan, 1986; 
Butterworth & Michael, 1975; Fuchs & Fuchs, 1985; Kohn, 1993). 
Descriptive feedback comes from many sources. It may be specific 
comments about the work, information such as posted criteria that describe 
quality, or models and exemplars that show what quality looks like and the 
many ways in which it can be expressed. Evaluative feedback, particularly 
summary feedback, is very different. It tells the learner how she or he has 
performed as compared to others or to some standard. Evaluative feedback is 
highly reduced, often communicated using letters, numbers, checks, or other 
symbols. It is encoded, and is decidedly not “rich” or “thick” in the ways 
suggested of descriptive feedback above. This creates problems with 
evaluative feedback for students — particularly for students who are 
struggling. Beyond the obvious disappointment of the inability of summary 
feedback to address students’ needs or the manner in which further growth 
and development can be realised, there are also problems that affect 
students’ motivation to engage in learning. Students with poor marks are 
more likely to see themselves as failures. Students who see themselves as 
failures may be less motivated and therefore less likely to succeed as 
learners (Black & Wiliam, 1998; Butler, 1988; Kamii, 1984; Kohn, 1993; 
Seagoe, 1970; Shepard & Smith, 1986a, 1987; Schunk, 1996). Involving 
students in assessment increases the amount of descriptive specific feedback 
available to learners while they are learning. Limiting specific feedback 
limits learning (e.g. Black & Wiliam, 1998; Hattie, in press; Jensen, 1998; 
Sadler, 1989b). 

Joslin (2002) studied the impact of criteria and rubrics on the learning of 
students in fifth and sixth grade (approximately 9-12 years of age). He 
found that when students use criteria in the form of a rubric that describes 
development towards success, students are better able to identify strengths 
and areas needing improvement. Joslin (2002) found that using criteria and 
rubrics affect student’s desire to learn in a positive way and expand their 
ability to assess and monitor their own learning. He notes that when scores 
alone were used, students who did not do well also did not know how to 
improve performance in the future. When students and teachers used the 
rubrics that described success they were able to talk about what they had 
done well and what they needed to work on next. Joslin (2002) states, 
“Students from the treatment group who received the rubric were aware of 
how well they would do on an assignment before being marked. The reason 
for their understanding was based on comments indicating they could check 
out descriptors they had completed. They were also able to identify what was 
needed to complete the task appropriately, indicating an awareness of self- 
monitoring and evaluation. In the comparison group students’ comments 
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reveal a lack of understanding of how they were evaluated. Students also 
indicated they would try harder to improve their grade next time but were 
unaware of what they needed to do to improve.” (p. 41). He concludes by 
writing, “This research study has indicated a positive relationship between 
the use of a rubric and students desire to learn.” (p. 42). When students have 
clear criteria, feedback can be more descriptive and portfolios can better 
support learning. 

4.5 Portfolios and Self-Reflection 

Meta-cognitive skills are supported and practised during portfolio 
development as students reflect on their learning and select work samples, 
put work samples in the portfolio, and prepare self-assessments that explain 
the significance of each piece of work. Portfolio construction involves skills 
such as awareness of audience, awareness of personal learning needs, 
understanding of criteria of quality and the manner in which quality is 
revealed in their work and compilations of it as well as development of skills 
necessary to complete a task (e.g. Duffy, Jones & Thomas, 1999; Mills- 
Court & Amiran, 1991; Yancey, 1997). 

Students use portfolios to monitor progress and to make judgements 
about their own learning (Julius, 2000). Julius (2000) examined elementary 
students’ perceptions of portfolios by collecting data from 22 students and 
their teachers from two third grade classrooms. Data collection included 
student and teacher interviews, observation of student-teacher conferences, 
portfolio artefacts, teacher logs and consultations with teachers. Portfolios 
were found to contribute to student’s ability to reflect upon their work and to 
the development of students’ sense of ownership in the classroom. Julius 
(2000) reports, “Results of this study indicated that students used portfolios 
to monitor their progress, students made judgements based on physical 
features, choice was a factor in the portfolio process, and, instructional 
strategies supported higher order thinking.” (p. vii) As students become 
more used to using the language of assessment in their classroom as they set 
criteria, self assess and give peers descriptive feedback, they become better 
able to use that feedback to explain the significance of different pieces of 
evidence and later to explain their learning to parents and others. 

One key aspect of classroom portfolios is students’ selecting evidence 
from multiple sources and explaining why each piece of evidence needs to 
be present - what it shows in terms of student learning and the manner in 
which it addresses the audience and the purpose of the portfolio. Portfolios 
communicate more effectively when the viewer knows why particular 
evidence has been included. Students who are involved in classroom 
assessment activities such as developing criteria use the language of 
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assessment as they develop criteria and describe the levels of quality on the 
way. Practice using the language of assessment prepares students to reflect. 
Their self-assessments become more detailed and better able to explain what 
evidence different pieces of evidence show. Initially, work is selected for 
reasons as “It was the longest thing I wrote.” or “It got the best grade.” Over 
time notions of quality become more sophisticated and citing specific criteria 
in use in the classroom and the manner in which evidence in the portfolio 
address those criteria. The capacity to do so is essential to high performance 
learning. Bintz and Harste (1991) explain, “Personal reflection required in 
portfolio evaluation increases students' understanding of the processes and 
products of learning...” 

4.6 Portfolios: Teachers as Learners 

Just as students learn by being involved in the portfolio process, so do 
teachers. There are five key ways teachers learn through portfolio use: 

1. Teachers learn about their students as individuals by looking at their 
learning represented in the portfolios. 

2. Teachers learn about what evidence of learning can look like over time 
by looking at samples of student work. 

3. Teachers form interpretative communities that most often have higher 
standards and more consistently applied standards (both from student to 
student and from teacher to teacher) for student work than was the case 
before entering into the development of portfolio systems. 

4. Teachers challenge and enrich their practice by addressing the higher 
expectations of student learning with classroom activities that more 
effectively address that learning and 

5. Teachers learn by keeping portfolios themselves to show evidence of 
their own learning over time. 

Tanner (2000) says that while there was some direct student learning 
from portfolio assessment, perhaps the “greater learning came from post- 
assessment teachers who created a better climate for writing and learning” 
(p. 63). Teachers, who knew more about learning returned to classrooms 
prepared to, “impact subsequent student cohorts.” (Tanner, 2000; p. 71). 
Tanner (2000) describes the learning - for students as well as their 
instructors - that emerges as students are involved in a school-wide portfolio 
process. Based on interviews with key informants he describes the changes 
that have occurred since 1986 that indicate positive changes in student 
attitude, choice, ownership, engagement as well as changes in teachers’ 
understanding and knowledge, and changes throughout the college. Herman, 
Gearhart and Aschbacher (1996) also report that portfolio use results in 
learning by improving teaching. The example given is Aschbacher’ s (1993) 
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action research which showed two thirds of teachers reporting substantial 
change in the way they thought about their teaching, two thirds reporting an 
increase in their expectations for students, and a majority found that 
alternative assessments such as portfolios reinforced the purpose or learning 
goals. 

There is increasing attention being paid to the process of examining 
student work samples as part of teachers’ professional learning and 
development (e.g. Blythe et ah, 1999; Richards, 2001; MAPP, 2002). This is 
a natural outgrowth of: 

• conversations amongst teachers (e.g. www.lasw.org; Blythe et ah, 1999), 

• school improvement planning processes (e.g. B.C. School Accreditation 
Guide, 1990; Hawai's Standards Implementation Design Process, 2001), 

• large-scale assessments (e.g. B.C. Ministry of Education, 1993; Fritz, 
2001; Willis, 2000), and 

• school-level work with parents to help them understand growth over 
time (Busick, 2001; Cameron, 1991). 

There are multiple reasons teachers examine student work samples by 
themselves or with colleagues as part of their own professional learning: 

1. Understanding individual students’ growth and development to inform 
students, teachers, and parents about learning. 

2. Increasing expectations of students (as well as the system that serves 
them) through encounters with student work that reveal capacities greater 
than previously believed. 

3. Making expectations for student performance more consistent, both 
across teachers and across students. 

4. Understanding next teaching steps by examining student work with 
colleagues analysing strengths, areas needing improvement and next 
teaching steps. 

5. Learning how to evaluate work in relation to unfamiliar standards fairly 
by comparing samples from students within a school. 

6. Gaining a better understanding of development over time by looking at 
samples of student work and comparing them to published developmental 
continuums. 

7. Developing a common understanding of standards of quality by looking 
at samples of student work in relation to standards. 

8. Learning to use rubrics from large-scale or district assessments to analyse 
work samples. 

9. Considering student achievement over time within a school or across 
schools in a district. 

10. Informing the public of the achievement levels of groups students. 
Teachers may examine student work from individuals, groups of 

students, multiple classes of students, or from different grade levels in 
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different subject areas. Blythe et al. (1999) describe different ways to 
involve teachers in examining student work. Parents are also beginning to be 
invited to join the process (e.g. BC Min of Ed. School Accreditation Process, 
1990). 

4.7 Portfolios: Parents and Other Community Members 
as Learners 

Portfolios can inform different stakeholders of ongoing growth and 
development (e.g. Danielson, 1996; Klenowski, 2000; McLaughlin & Vogt, 
1996; Shulman, 1998; Wolf, 1996; Zeichner & Liston, 1996). Efforts to 
include parents and others as assessors of student learning have revealed a 
further power of portfolios. Not only do parents come to a fuller 
understanding of their children’s learning, they better appreciate the goals 
and instructional approaches of the learner and the teacher(s). This in turn 
makes them more effective partners in their children’ s learning and ensures 
their support for teachers’ efforts at innovation and change (Howard & 
LeMahieu, 1995; Joslin, 2002b). Conversations with parents, teachers, and 
children with portfolios as an evidentiary basis provide a more complete 
picture of children’s growth and understanding than standardised test scores. 
They also provide ideas so parents can better support their children’s 
learning in and out of school, so teachers can better support the learner in 
school, and so the learners can support themselves as they learn. Lurther, 
portfolios and the conversations that take place in relation to them, can 
promote the involvement of all the members of the learning community in 
educating children (e.g. Gregory et al, 2001; Lu & Lamme, 2002). 

4.8 Portfolios: Schools and Systems Learn 

Schools learn (e.g. Costa & Kallick, 1995; Lullan, 1999; Schlechty, 1990; 
Schmoker, 1996; Senge, 2000; Sutton, 1997) and need assessment 
information in order to continue to learn and improve. Portfolios and other 
work sample systems can help schools both learn and show their learning. 
Systems learn (e.g. Senge, 1990) and need reflective feedback to help them 
continue to improve. They need assessment information. Portfolios can be a 
part of the evidence collected both to support and to substantiate the learning 
and are increasingly used for assessment of students, teachers, schools, 
districts, and educational jurisdictions such as provinces or states (Darling- 
Hammond, 1997; Lritz, 2001; Gillespie et al., 1996; Millman, 1997; Ryan & 
Miyasaka, 1995). When it comes to systems learning, numerous researchers 
have made recommendations based on their experiences with portfolios for 
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district and cross-district assessments. Reckase (1995), for example, 
recommends a collection that represents all the tasks students are learning 
including both final and rough drafts. Fritz (2001) reviewed the Vermont 
large-scale portfolio assessment program and notes that since it began in 
1989 there have been numerous studies and recommendations leading 
towards ongoing improvements in the design and scoring as well as the way 
data is used to improve the performance of schools. Portfolios are collected 
from all students, scored at the school level and then a sampling of portfolios 
is also scored at the state level. Kentucky has had a portfolio component in 
its assessment system since 1992 when the Kentucky Educational Reform 
Act became law (See Redfield & Pankratz, 1997 or Willis, 2000 for a 
historical overview). Like Vermont, the Kentucky mandated performance 
assessment has evolved over time with changes being made to the portfolio 
contents as well as the number of work samples required. Overtime some 
multiple choice test items and on-demand writing have been included 
(Lewis, 2001). The state of Maine has an on-going portfolio project that is 
developing tasks and scoring rubrics for use in classrooms as part of the 
Comprehensive Local Assessment System districts need to have in place by 
2007. The New Standards Project co-ordinated and reported by Resnick and 
Resnick (1993) looked at samples of student work in Mathematics and 
English/Language Arts from almost 10,000 grade-4 students. 



5. DISCUSSION 

Barth (2000) has made the point that in 1950 students graduated from 
high school knowing 75% of what they needed to know to succeed. In 2000, 
students graduated with 2% of what they needed to know because 98% of 
what they needed to know to be successful was not yet known. This fact 
alone fundamentally changes the point of schooling. Today a quality high 
school education that provides these new basic skills is a minimum. Even 
more than this, a quality high school education must equip the learner to 
continuously grow, develop and learn throughout his or her lifetime. Post- 
secondary education can strive to do no less. For example, the Globe and 
Mail, a national Canadian newspaper noted, “employers’ relentless drive to 
avoid the ill-educated. (March 1, 1999).” They went on to note that jobs for 
people with no high school diplomas fell 27%. In 1990 employees with post- 
secondary credentials held 41% of all jobs while in 1999 that had risen to 
52% of all jobs. The trend is expected to continue and the gap widen. 
Government commissions, business surveys, newspaper headlines and the 
help wanted advertisements all testify to the current reality - wanted: 
lifelong learners who have new skills basic to this knowledge age - readers. 
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writers, thinkers, technologically literate, and able to work with others 
collahoratively to achieve success. We can’t prepare students to he lifelong 
learners without changing classroom assessment. Broadfoot (1998) puts it 
this way, “the persistence of approaches to assessment which were 
conceived and implemented in response to the social and educational needs 
of a very different era, effectively prevents any real progress.” (p. 453) 
Traditional forms of assessment were not conceived without logic, they were 
conceived to respond to an old, now outdated, logic. 

Meaningful assessment reform will occur when 

• students are deeply involved in the assessment process; 

• evidence of learning is defined broadly enough for all learners to show 
what they know; 

• classroom assessment is understood to be different than other kinds of 
assessment; 

• an adequate investment in assessment for learning is made; and, 

• a proper balance is achieved between types of assessment 
Accountability for individual student learning involves looking at the 

evidence with learners, making sense of it in terms of student strengths, 
areas needing improvement, and helping students learn ways to self-monitor 
their way to success. Classroom assessment will achieve its primary purpose 
of supporting student learning when it is successfully used to help students 
learn more and learn ways to lead themselves to higher levels of learning. 
Portfolio assessment can play a significant role. 

In our experience, two things have invariably been realised through the 
“assessment conversations” that are entered into by teachers in the 
development of portfolio systems. Both of these outcomes greatly enhance 
the intellectual and human capital of the systems and contribute to the 
potential for their improved performance. First, all who participate in the 
development of portfolio systems leave with higher and more clearly 
articulated aspirations for student performance. This should not be 
surprising, as the derivation of criteria and expectations for quality in 
performances is essentially additive. One professional sees certain things in 
the student work while the next recognises some of these (but perhaps not 
all) and adds some more. These assessment conversations proceed until the 
final set of aspirations (criteria of quality) is far greater than the initial one or 
that of any one member of the system at the outset. The second effect of 
these assessment conversations is that a shared interpretative framework for 
regarding student work emerges. The aspirations and expectations become 
commonly understood across professionals and more consistently applied 
across students. Again, the nature of these conversations (long term shared 
encounters and reflections) intuitively supports this outcome. 
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These two outcomes of assessment conversations — elevated aspirations 
and more consistently held and applied aspirations — are key ingredients in a 
recipe for beneficial change. Educational research is nowhere more 
compelling than in its documentation of the relationship between 
expectations and student accomplishment. Where expectations are high and 
represent attainable yet demanding goals, students strive to respond and 
ultimately achieve them. These assessment conversations, focused upon 
student work produced in response to meaningful tasks provide powerful 
evidence that warrants the investment in the human side of the educational 
system. 

It is for these reasons that we are optimistic about the place of portfolios 
in reform in North America. Yet, that said, portfolios are not mechanical 
agents of change. We do not accept the logic that says that the testing 
(however new or enlightened) coupled with North America version of 
accountability will motivate increased performance. In fact, we find it a 
cynical argument presuming as it does that all professionals in the system 
could perform better but for reasons (that will be eliminated by the proper 
application of rewards and sanctions) they have simply chosen not to. 
However, our experience also suggests that in order for the full potential of 
assessment development or teacher and student engagement in rich and 
rewarding assessment tasks to be realised, it must be approached in a manner 
consistent with the understandings developed here. 

Portfolios pose unique challenges in large-scale assessment. Involving 
students in every part of the portfolio process is critical to its success as a 
learning and assessment tool. Choice and ownership, thinking about their 
thinking, and preparing evidence for an audience whose opinion they care 
about are key aspects of portfolio use in classrooms. These critical features 
risk being lost when the portfolio contents and selection procedures are 
dictated from outside the classroom for accountability purposes. Without 
choice and student ownership, portfolios may be limited in their ability to 
demonstrate student learning. This may mean that large-scale portfolio 
assessment may become a barrier to individual student learning. However, 
using portfolios for large-scale assessment (when done well) can potentially 
support system learning in at least these ways: 

• Facilitating a better understanding of learning and achievement trends 
and patterns over time 

• Informing educators about learning and assessment as they analyse 
resulting student work samples 

• Enhancing professionals’ expectations for students (and themselves as 
facilitators of student learning) as a result of working with learner’s 
portfolios 
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• Making it possible to assess valued outcomes that are well beyond the 
reach of other means of assessment 

• Informing educators’ understandings of what learning looks like over 
time as they review collections of student work samples 

• Helping students to understand quality as they examine collections of 
student samples to better understand the learning and what quality can 
look like 

• Assisting educators and others to identify effective learning strategies 
and programs 

Purpose is key. Whose learning is intended to be supported? Student? 
Teacher? School? System? 

Assessments without a clear purpose risk muddled methods, procedures, 
data, and findings (e.g. Chan, 2000; Paulson, Paulson, & Meyer, 1991; 
Stiggins, 2001). For example, one group indicated that the jurisdiction could 
use portfolios to assess individual student achievement, teaching, educators, 
schools, and provide state level achievement information (see for example, 
Richard, 2001). This is trae but different portfolios would be required or the 
purpose could be confused, the methods inappropriately used, the procedures 
incorrect, the resulting portfolios likely inappropriate to the stated purposes 
and the findings inaccurate. When the purpose and audience shifts, the 
portfolio design, content, procedures, and feedback need to be realigned. If 
the purpose for collecting evidence of learning and using portfolios is to 
support student learning then it may not be necessary for portfolios to be 
evaluated, scored or graded. If the purpose for collecting evidence of 
learning and using portfolios is to support educators (and others) as they 
leam and seek to improve system performance then portfolios will be 
standardised to some necessary degree, evaluated and scoring results made 
available to educators and others. 
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1. INTRODUCTION 

Educational innovations in instruction and assessment have been 
overwhelming during the latest decade: new teaching methods and strategies 
are introduced in teacher and higher education and teaching practice, the 
latest technologies and media are used, and new types and procedures of 
assessment are developed and implemented. Most of these innovations are 
inspired by constructivist learning theories, in which the learner is an active 
partner in the process of learning, teaching and assessment. This belief in the 
active role of the student in instruction and assessment and the finding of 
Entwistle (1991) that it are students’ perceptions of the learning environment 
that influence how a student learns, not necessarily the context in itself, both 
gave rise to this review study. Reality per se is often not sufficient to fully 
understand student learning and accompanying assessment processes. 
“Reality as experienced by the student” has in this respect an important 
additional value. It is this second-order perspective (Van Rossum & Schenk, 
1984), that is the primary concern of this review on new modes of 
assessment. Our purpose is to overview the research and literature on 
students’ perceptions about assessment, with the aim to achieve a better 
understanding of students’ perceptions about assessment in higher education 
and to gain insight into the potential impact of these perceptions on student 
learning, and more broadly, the learning- teaching environment. Eollowing 
questions were of special interest to this review: (1) what are the influences 
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of the (perceived) characteristics of assessment on students’ approaches to 
learning, and vice versa, (2) what are students’ perceptions about different 
novel assessment formats and methods, and (3) what are the influences of 
these students’ perceptions about assessment on student learning? 



2. METHODOLOGY FOR THE REVIEW 

The Educational Resources Information Center (ERIC), the Web of 
Science, PsychINEO and Current Content, were searched online for the 
years 1980 until now. The keywords “student* perception*” and 
“assessment” were combined. This search yielded 508 hits in the databases 
of ERIC and PsycINEO and 37 hits within the Web of Science. When this 
search was limited with the additional keyword “higher education”, only 171 
hits and 10 hits respectively remained. Relevant documents were selected 
and searched for in the libraries and the e- library of the K.U. Leuven. Eor 
the purpose of this review on students’ perceptions about assessment in 
higher education, 35 documents met our criteria. Within these selections of 
literature, 36 empirical studies are discussed. Eor a summary of this 
literature, we refer to the overview, presented in table 1. Theoretical and 
empirical articles are both included. Using other literature reviews as a guide 
(Topping, 1998; Dochy, Segers, Gijbels & Van den Bossche, 2002), we 
defined the characteristics central to this review and analysed the empirical 
articles according to these characteristics. Eirst, a specific code is given to 
each article, for example: 1996/03/EA. This code refers to the publication 
year/ number/ publication type (EA: empirical article/ TA: theoretical article/ 
CB: chapter of book). Second, the author/ s) and title of the publication are 
presented. Eurther, the overview reports on the following characteristics of 
the reviewed research: (1) the content of the study, (2) the type and method 
of the investigation, (3) the subjects and the type of education in which the 
study is conducted, (4) the number of subjects in the experimental and 
control group, (5) the most important results that were found, (6) the 
independent and (8) dependent variables studied, (7) the treatment which 
was used, and (9) the type and (10) method of analyses reported in the 
research. Both qualitative and quantitative investigations are discussed. 
Because of the large diversity of research that was found on this particular 
topic (e.g. exploratory studies, experiments, surveys, case studies, 
longitudinal studies, cross- section investigations, qualitative interpretative 
research and quantitative research methods), a narrative review is conducted 
here. In a narrative review, the author tries to make sense of the literature in 
a systematic, creative and descriptive way (Van IJsendoorn, 1997). To 
prevent bias, because of the more intuitive nature of the narrative review, we 
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reported on the procedure and criteria used to locate the studies and we 
described the methodological issues of the research as completely and 
objectively as possible. 
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3. STUDENTS’ PERCEPTIONS ABOUT 
ASSESSMENT 

The repertoire of assessment methods in use in higher education has 
expanded considerably in recent years. New assessment methods are 
developed and implemented in higher education, for example: self, peer, and 
co-assessment, portfolio assessment, performance assessment, simulations, 
formative assessment and OverAll assessment. The notion of “alternative 
assessment” is often used to denote forms of assessment which differ from 
the conventional assessment methods, such as multiple- choice testing and 
essay question exams and continuous assessment through essays and 
scientific reports (Sambell, McDowell & Brown, 1997). New constructivist 
theories and practices go together with a shift from a “test” culture to an 
“assessment” culture. The assessment culture, embodied in current uses of 
alternative assessment favors: the integration of assessment, teaching and 
learning; the involvement of students as active and informed participants; 
assessment tasks which are authentic, meaningful and engaging; assessments 
which mirror realistic contexts; focus on both the process and products of 
learning; and moves away from single test- scores towards a descriptive 
assessment based on a range of abilities and outcomes (Sambell et ah, 1997). 

In this part of the review, the literature and research on students’ 
perceptions about assessment are reviewed. The impact of (perceived) 
characteristics about assessment on students’ approaches to learning and vice 
versa, is examined and discussed. This way, an attempt is made to answer 
our first question of special interest to this review. Next, students’ 
perceptions about different, new modes of assessment are presented, 
including: portfolio assessment; self- and peer assessment; overall 
assessment; simulations; and finally, more general perceptions of students 
about assessment are investigated. This analysis has the aim to gain insight 
into our second review question: “What are students’ perceptions about 
different alternative assessment formats and methods?”. Finally, the effects 
of students’ perceptions about assessment on student learning are reviewed, 
and therefore an answer to our third and last question is provided. 

It should be notified that there are marked differences of what 
“perception” means in the operational sense for various studies. Some 
authors define perceptions as the opinions (e.g. do you think that cheating is 
ethical justifiable?) that students have concerning learning and studying, 
cheating and plagiarism, etc. Also students’ attitudes (e.g. do you find this 
assessment format difficult?) towards and preferences (e.g. do you prefer 
multiple choice test to an essay exam?) for different formats of assessment 
are included in the concept of “perception”. Yet other researchers try to 




Students ’ Perceptions about New Modes of Assessment 



191 



capture students’ experiences (e.g. how did you handle this task?) with a 
particular or several assessment formats, with the word “perception”. 

These differences are pointed out in the text and should be taken into 
consideration, while interpreting the investigations, its results and its 
educational implications. 

3.1 Assessment and Approaches to Learning 

Assessment is one of the defining features of the students’ approaches to 
learning (e.g. Marton & Saljo, 1997; Entwistle & Entwistle, 1991; Ramsden, 
1997). In this part of the review, an attempt is made to gain insight into the 
relations between (perceived) assessment properties and students’ 
approaches to learning and studying. 

3.1.1 Approaches to Learning 

When students are asked for their perceptions about learning, mainly 
three approaches to learning occur: (1) the surface approach to learning, (2) 
the deep approach to learning, and (3) the strategic or achievement approach 
to learning. 

a. Surface approach to learning 

Surface approaches to learning describe an intention to complete the task 
with little personal engagement, seeing the work as an unwelcome external 
imposition. This intention is often associated with routine and unreflective 
memorisation and procedural problem solving, with restricted conceptual 
understanding being an inevitable outcome (Entwistle & Ramsden, 1983; 
Entwistle, McCune & Walker, 2001). The surface approach is related to 
lower quality outcomes (Trigwell & Prosser, 1991). 

b. Deep approach to learning 

Deep approaches to learning, in contrast, lead from an intention to 
understand, to active conceptual analysis and, if carried out thoroughly, 
generally result in a deep level of understanding (Entwistle & Ramsden, 
1983). This approach is related to high quality learning outcomes (Trigwell 
& Prosser, 1991). However, this deep approach is not necessarily always the 
“best” way, but it is the only way to understand learning materials 
(Entwistle, et al. 2001). 

c. Strategic or achieving approach to learning 

Several students refer in their perceptions on learning to the assessment 
procedures they experience. Because of the pervasive evidence of the 
influence of assessment on learning and studying an additional category was 
introduced, namely the strategic or achievement approach to learning, in 
which the student’s intention was to achieve the highest possible grades by 
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using organized study methods and effective time- management (Entwistle 
& Ramsden, 1983). The strategic (or achieving) approach describes well- 
organized and conscientious study methods linked to achievement 
motivation- the determination to do well. The student relates studying to the 
assessment requirements in a manipulative, even cynical, manner (Entwistle, 
et al. 2001). The following student’s comment evidences this statement: 

“I play the examination game. The examiners play it to. ... The technique 
involves knowing what is going to be in the exam and how it ‘s going to be 
marked. You can acquire these techniques from sitting in the lecturer’s class, 
getting ideas from his point of view, the form of the notes, and the books he 
has written- and this is separate to picking up the actual work content” 
(Entwistle & Entwistle, 1991, p. 208). 

3.1.2 Assessment in Relation to Students’ Approaches and Vice 
Versa 

The research on the relation between approaches to learning and 
assessment is dominated by the Swedish Research Group of Marton and 
Saljo. These two researchers (Marton & Saljo, 1997) conducted a series of 
studies in which they tried to influence the students’ approaches to learning 
towards a deep approach to learning. A prerequisite for attempting to 
influence how people act in learning situations is to have a clear grasp of 
precisely how different people act. ‘‘What is it that a person using a deep 
approach does differently from a person using a surface approach?”. The 
learner/ reader, using a deep approach to learning, engages in a more active 
dialogue with the text. One of the problems with a surface approach is the 
lack of such an active and reflective attitude towards the text. Consequently, 
an obvious idea was to attempt to induce a deep approach through giving 
people some hints on how to go about learning (Marton & Saljo, 1997). 

In his first study, Marton (1976) adopted the following procedure for 
influencing the approach to learning. In the experimental group, the students 
had to answer questions of a particular kind while reading a text. These 
questions were of the kind that students who use a deep approach had been 
found to ask themselves spontaneously during their reading. The design of 
this study included an immediate, as well as a delayed, retention test. This 
attempt to induce a deep approach through forcing people to answer 
questions found to be characteristic of such an approach, yielded interesting 
but contra- intuitive results. At one level, it was obvious that the approach 
taken was influenced by the treatment to which the experimental group was 
exposed. However, this influence was not towards a deep approach: instead, 
it seemed to result in a rather extreme form of surface learning. The control 
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group, which had not been exposed to any attempts at influencing the 
approach taken, performed significantly better. 

What happened was that the participants invented a way of answering the 
interspersed questions without engaging in the learning, characteristic of a 
deep approach. The task was transformed into a rather trivial and mechanical 
kind of learning, lacking the reflective elements found to signify a deep 
approach. What allowed the participants to transform the learning in this 
way was obviously the predictability of the task. They knew that they would 
have to answer questions of this particular kind, and this allowed them to go 
through the text in a way that would make it possible to comply with the 
demands without actually going into detail about what is said. This process 
can be seen as a special case of the common human experience of 
transformation of means into ends. The outcome of this study raises 
interesting questions about the conditions for changing people’s approach to 
learning. The demand structure of the learning situation again proved to be 
an effective means of controlling the way in which people set about the 
learning task. Actually, it turned out to be too effective. The result was in 
reality the reverse of the original intention when setting up the experiment. 
The predictability of the demand structure played a central role in generating 
this paradoxical outcome (Marton & Saljd, 1997). 

A second study (Saljd, 1975) followed. Forty university students were 
divided into two groups. The factor varying was the nature of the questions 
that the groups were asked after reading each of several chapters from an 
education textbook. One set of questions was designed to require a rather 
precise recollection of what was said in the text. In the second group, the 
questions were directed towards major lines of reasoning. After reading a 
final chapter, both groups were exposed to both kinds of questions and they 
were required to recall the text and summarise it in a few sentences. The 
results show that a clear majority of the participants reported that they 
attempted to adapt their learning to the demands implicit in the questions 
given after each successive chapter. The crucial idea of this study was that 
people would respond to the demands to which they were exposed. In the 
group that was given “factual” questions, this could be clearly seen. They 
reacted to the questioning through adopting a surface approach. However, in 
the other group, the reaction did not simply involve moving towards a deep 
approach. Some did, others did not. A fundamental reason underlying this 
was differing interpretations of what was demanded of them. Only about half 
the group interpreted the demands in the way intended. The other students 
“technified” their learning, again concentrating solely on perceived 
requirements. They could summarise, but could not demonstrate 
understanding (Marton & Saljd, 1997). 
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It is important to realise that the indicators of a deep approach, isolated in 
the research, are symptoms of a rather fundamental attitude towards what it 
takes to learn from texts. What happened was that some students made it an 
end in itself to be able to give a summary of the text after each chapter. This 
is thus an example of the process of technification of learning resulting in 
poor performance. Both studies (Marton, 1976; Saljo, 1975) illustrate that 
although in one sense it is easy to influence the approach people adopt when 
learning, in another sense, it appears very difficult. It is obviously quite easy 
to induce a surface approach; however, when attempting to induce a deep 
approach the difficulties seem quite profound. The explanation is in the 
interpretation (Marton & Saljo, 1997). 

In a third study, Marton and Saljo (1997) asked students to recount how 
they had been handling their learning task and how it appeared to them. The 
basic methodology was that students were asked to read an article, knowing 
they would be asked questions on it afterwards. Besides the questions about 
what they remembered of its content, students were also asked questions 
designed to discover how they tackled this task. All the efforts, readings and 
re-readings, iterations and reiterations, comparisons and groupings of the 
researchers finally turned into an astonishingly simple picture. The students 
who did not get “the point” (that is, they did not understand the text as a 
whole) failed to do so, simply because they were not looking for it. The main 
difference that was found in the process of learning concerned whether the 
students focused on the text itself or on what the text is about: the author’s 
intention, the main point, the conclusion to be drawn. In the latter case, the 
text is not considered as an aim in itself, but rather as a means of grasping 
something that is beyond or underlying it. It can be concluded that there was 
a very close relationship between process and outcome. The depth of 
processing was related to the qualily of outcome in learning (Marton & 
Saljo, 1997). 

The students’ perceived assessment requirements seem to have a strong 
influence on the approach to learning a student adopts when tackling an 
academic task (Saljo, 1975; Marton & Saljo, 1997). Similar findings 
emerged from the Lancaster investigation (Ramsden, 1981) in relation to a 
whole series of academic tasks and to students’ general attitudes towards 
studying. Students often explained surface approaches or negative attitudes 
in terms of their experiences of excessive workloads or inappropriate forms 
of assessment. The experience of learning is made less satisfactory by 
assessment methods that are perceived to be inappropriate ones. High 
achievement in conventional terms may mask this dissatisfaction and hide 
the fact that students have not understood material they have learned as 
completely as they might appear to have done. Inappropriate assessment 
procedures encourage surface approaches; yet varying the assessment 
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questions may not be enough to evoke fully deep approaches (Ramsden, 
1997). 

Entwistle and Tail (1990) also found evidence for this relation between 
students’ approaches to learning and their assessment preferences. They 
found that students who reported themselves as adopting surface approaches 
to learning preferred teaching and assessment procedures which supported 
that approach, whereas students reporting deep approaches preferred courses 
which were intellectually challenging and assessment procedures which 
allowed them to demonstrate their understanding. A direct consequence of 
this effect is that the ratings that students make of their lecturers will depend 
on the extent to which the lecturer’s style fits what individual students prefer 
(Entwistle & Tail, 1995). 

3.1.3 Implications for Teaching and Assessment Practice 

Assessment and approaches to learning are strongly related. The 
(perceived) characteristics of assessment have a considerable impact on 
students’ approaches, and vice versa. These influences can be both positive 
and/ or negative. The literature and research on students’ perceptions of 
assessment in relation to the students’ approaches to learning, suggest that 
deep approaches to learning are encouraged by assessment methods and 
teaching practices which aim at deep learning and conceptual understanding, 
rather than by trying to discourage surface approaches to learning (Trigwell 
& Prosser, 1991). Therefore, lectures and educational policy play an 
important role in creating these “deep” learning environments. 

The next subsection about students’ perceptions of diverse assessment 
formats and methods can equip us with valuable ideas, interesting tip-offs 
and useful information to bring this deep learning and conceptual 
understanding into practice. 

3.2 Assessment Format and Methods 

During the last decade, an immense set of alternative assessment was 
developed and implemented into educational practice as a result of new 
insights and changing theories in the field of student learning. Students are 
supposed to be “active, reflective, self- regulating learners”. Alternative 
assessment practices must stimulate these activities, but do they? An attempt 
is made to answer this question from the students’ perspective. 

In this part of the review, we provide an answer to our second review 
question: “What are students’ perceptions about new modes of assessment?” 
Students’ perceptions about several novel assessment methods are examined 
and discussed. Research studies report on a variety of formats: portfolio 
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assessment, self- and peer assessment, Over All assessment, and simulations. 
Additionally, a study of Kniveton (1996) compares students’ perceptions on 
evaluation versus continuous assessment. Based on these reviewed studies, 
some implications for teaching and assessment practice are given. 

3.2.1 Portfolio Assessment 

The overall goal of the preparation of a portfolio is for the learner to 
demonstrate and provide evidence that he or she has mastered a given set of 
learning objectives. Portfolios are more than thick folders containing student 
work. They are personalised, longitudinal representations of a student’ s own 
efforts and achievements. Students have to do more than memorise lecture 
notes and text materials because of the active creation process involved in 
preparing a portfolio. They must organise, synthesise and clearly describe 
their achievements and effectively communicate what they have learned. 
The primary benefit is that the integration of numerous facts to form broad 
and encompassing concepts is actively performed by the student instead of 
the instructor (Slater, 1996). Other reasons for using portfolios for 
assessment purposes include the impact that they have in driving student 
learning and their ability to measure outcomes such as professionalism 
(Friedman Ben-David, Davis, Harden, Howie, Ker, & Pippard, 2001). Slater 
(1996) gathered the findings on students’ perceptions of portfolios from 
several studies with first-year undergraduate physics students in the USA. 
Qualitative data were collected through formal interviews, focus group 
discussions, and open- ended written surveys. Most students interviewed and 
surveyed report that, overall, they like this alternative procedure for 
assessment. Portfolio assessment seems to remove their perceived level of 
“test anxiety”. This reduction shows up in the way students attend to class 
discussions, relieved of their vigorous note- taking duties. Students thought 
that they would remember what they were learning much better and longer 
than they do with the material for other classes they took, because they had 
internalised the material while working with it, thought about the principles, 
and applied physical science concepts creatively and extensively over the 
duration of the course. The most negative aspect of creating portfolios is that 
they spend a lot of time going over the textbook or required readings. 
Students report that they are enjoying time spent on creating portfolios and 
that they believe it helps them learn physics concepts (Slater, 1996). 

Boes and Wante (2001) also investigated student teachers’ perceptions of 
portfolios as an instrument for professional development, assessment and 
evaluation. Data were collected through portfolio- analysis, observations, 
informal interviews with staff and an open questionnaire for students. A 
sample of 48 student teachers in two Flemish institutions for teacher 
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education was surveyed. The students felt portfolios stimulated them to 
reflect and demonstrated their professional development as prospective 
teachers. They felt engaged in the portfolio- creating- process, but portfolio 
construction in itself was not sufficient. Students thought that supervision, in 
addition, was desirable and necessary. They saw portfolios as an instrument 
for professional development and personal growth, but advantages were 
especially seen in relation to evaluation. When students did not get grades 
for their portfolios, much lesser efforts were made to construct the portfolio. 
Although portfolios are an important source of personal pride, students 
thought portfolios were very time- consuming and expensive. Portfolios 
appear very useful in learning environments in which instruction and 
evaluation form integrated parts (Boes & Wante, 2001). 

Meyer and Tusin (1999) also examined pre-service teachers’ pedagogical 
beliefs and their definitions of and experiences with portfolios. They 
investigated whether students’ pedagogical beliefs were related to their 
definitions and experiences with portfolios. The students in this study are 
familiar with portfolios, as an integral part of their elementary education 
program. Two types of portfolios are introduced: (1) student portfolios for 
assessment in the classroom, and (2) professional portfolios for evaluation of 
teachers. It is hypothesised that the students’ pedagogical beliefs and their 
method course experiences and field experiences are important and related 
influences on how pre-service teachers define and use portfolios. Whether 
teachers view portfolios as product or process might be an important 
influence on how they conceptualise and use portfolios. Pre-service teachers’ 
pedagogical beliefs are examined in terms of their achievement goals. 
During one semester, a sample of 20 elementary education majors was 
followed through methods courses into student teaching and their first year 
of classroom teaching. The sample consists of two groups of pre-service 
teachers: students completing their final methods coursework (education 
majors) and students completing their student teaching (student teachers). 
An informal survey about portfolios and a motivational survey designed for 
teachers, the Patterns of Adaptive Learning Survey (PALS), were conducted. 
The data were collected in two phases. The results indicated that beliefs 
about pedagogical practices appeared stable and did not differentiate the two 
groups although their levels of experience varied. Student teachers had more 
experience with portfolios personally and in field experiences prior to 
student teaching. Education majors reported more experience with portfolios 
in methods courses. Another result is that significant individual differences 
were found in how the students reported their beliefs about process- versus 
product oriented approach to teaching. Three patterns among the pre-service 
teachers’ self- reports of their pedagogical beliefs were found: (1) the 
moderate perspective (n= 12), (2) the product/performance perspective (n= 
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4), and (3) the process perspective (n= 4). This study shows an influence of 
students’ pedagogical heliefs and experiences on their definitions and use of 
portfolios, however, the complexity of these interactions and each student’ s 
uniqueness were underestimated. It appeared that “what we thought we were 
teaching and modelling was not always what students were learning and 
perceiving” (Meyer & Tusin, 1999, p. 137). 

A critical and additional comment is given hy Challis (2001) who argues 
that portfolios have a distinct advantage over other assessment methods, as 
long as they are judged within their own terms, and not hy trying to make 
them replicate other assessment processes. Portfolio assessment simply 
needs to he seen in terms that recognise its own strengths and its differences 
from other methods rather than as a replacement of any other assessment 
methods and procedures (Challis, 2001). 

3.2.2 Self- and Peer Assessment 

Self- assessment and peer assessment, as well as portfolio assessment and 
Overall Assessment (see 4.2.5), are typical examples of alternative 
assessment methods in which the progressive perspectives of the 
constructivist movement are central. 

3.2.2.1 Self- Assessment 

Orsmond and Merry (1997) implemented and evaluated a method of 
student self- assessment. The study concerns the importance of 
understanding marking criteria in self- assessment. Pairs of first-year 
undergraduate biology students were asked to complete a poster assignment 
on a specific aspect of nerve physiology. The study was designed to allow 
the evaluation of (1) student self versus tutor marking for individual marking 
criteria, and (2) student versus student marking of their poster work for 
individual marking criteria. In the first stage of the research, 105 students 
were informed that as a part of their practical work a scientific poster was to 
he produced in laboratory time. The overall theme was to be an aspect of 
nerve psychology, but the students would have the choice of the specific 
subject for their posters. The students were told that they had to work in 
pairs and they were given the precise date that the finished poster would be 
displayed. In a second stage, the students were given verbal instructions 
about the poster marking scale, and the self- marking procedure. In the third 
stage, the students were given written instructions about what was required 
for the poster assessment. The written instructions supported the previous 
verbal instructions. During the fourth and last stage, the assessment exercise 
took place. The 105 students were asked to fill in an individual evaluation 
questionnaire, so that students’ feedback on the exercise could be obtained. 
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A poster marking form was used by individual students to mark each poster 
for five separate marking criteria. A space for students’ comments was 
provided. An overall mark for each poster was obtained by adding the five 
criterion values together. Once all the posters had been marked by the 
students, the tutor marked the work. A comparison between the tutor and the 
student self- assessed mark revealed an overall disagreement of 86%, with 
56% of students over- marking and 30% under- marking. It is noticeable that 
poor students tend to over- mark their work, whilst good students tend to 
under- mark. If the individual criteria are considered, than the number of 
students marking the same as the tutor ranged form 31% to 62%. The 
agreement among students’ marks ranged from 51 to 65%. Students 
acknowledged the value of this self- marking task. They thought that self- 
assessment made them think more and felt that they learned more. Most of 
the students reported that self- assessment made them more critical of their 
work and they felt that they worked in a more structured way. Self- 
assessment is perceived as challenging, helpful and beneficial. It is 
concluded that marking is a subjective activity and having clear marking 
criteria that are known to both students and tutor allows the students to see 
how their marks have been obtained. It is far better to take the risk over 
marks than to deprive students of the opportunity of developing the 
important skills of making objective judgements about the quality of their 
own work (and that of their peers) and of generally enhancing their learning 
skills (Orsmond & Merry, 1997). 

Mires, Friedman Ben-David, Preece and Smith (2001) undertook a pilot 
study to evaluate the feasibility and reliability of undergraduate medical 
student self- marking of degree written examinations, and to survey students’ 
opinions regarding the process. A paper consisting of four constructed 
response questions was administered to 119-second year students who 
volunteered to take the test under examination conditions. These volunteers 
were asked to return for the self- marking session. Again, under 
examinations, 99 students who attended the self- marking session, were 
given back their original unmarked examination scripts. The agreed correct 
responses were presented via an overhead projector and students were asked 
to mark their responses. There was no opportunity for discussion. Prior to 
leaving the session, students were asked to complete an evaluation form 
which asked them about the value of the exercise (3- point Likert scale), 
certainty of marking and advantages and disadvantages of the process. In 
contrast to the study of Orsmond and Merry (1997), a comparison between 
the student’s marks and the staff’s marks, for each question and the 
examination as a whole, revealed no significant differences. Student self- 
marking was demonstrated to be reliable and accurate. If student marks 
alone had been used to determine passes, the failure rate would have been 
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almost identical to that derived from staff marks. The students in study, 
however, failed to acknowledge the potential value of self- marking in terms 
of feedback and as a learning opportunity, and expressed uncertainty over 
their marks. Students perceived many more disadvantages than advantages 
in the self- marking exercise. Disadvantages included: finding the process 
stressful, feeling that they could not trust their own marking and having 
uncertainties on how to mark, being too concerned about passing/ failing to 
learn from the exercise, worrying about being accused of cheating and hence 
having a tendency to under-mark, having the opportunity to “cheat”, finding 
the process tedious, considering it time consuming and feeling that the 
faculty were “offloading” responsibility. Advantages included the feeling of 
some students that it was useful to know where they had gone wrong and 
that feedback opportunity was useful (Mires et al., 2001). 

These two studies revealed interesting but quite opposite results. The 
different task conditions could serve as a plausible explanation. A first task 
condition that differs in both studies is the clarity of the marking criteria. In 
the second study, for each question the agreed correct answer was presented, 
while in the first study, only general marking guidelines were given. These 
marking guidelines were not as specific and concrete as those provided by 
the correct answers. Another important task condition that differed, was the 
level of stress experienced in the situation. In the first study, the task formed 
a part of the practical work the students had to produce during laboratory 
time. This is in strong contrast to the second study, in which the task was an 
examination. The level of stress in this situation was high(er), because the 
evaluative consequences are more severe. Students’ primary concern was 
whether they failed or passed the examination. This stressful pre- occupation 
with passing and, failing, is probably the reason why students could not 
acknowledge the potential value of the self- marking exercise for feedback 
purposes or as a learning opportunity. 

3.2.2.2 Peer Assessment 

Segers and Dochy (2001) gathered quantitative and qualitative data from 
a research project which focused on different quality aspects of two new 
assessment forms in problem- based learning: the Over All Test (see 4.2.5) 
and peer assessment. Problem- based learning intends to change the learning 
environment towards a student- centred approach, where knowledge is a tool 
for effective problem analysis and problem- solving, within a social context 
where discussion and critical analysis are central. In the Louvain case, peer 
assessment was introduced for students to report on collaborative work 
during the tutorial meeting, and during the study period that follows these 
weekly meetings. Pearson correlation values indicated that peer and tutor 
scores are significantly interrelated. The student self- scores are, to a minor 
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extent, related to peer and tutor scores. These findings suggest that students 
experience difficulties in assessing themselves. Critical analysis of their own 
functioning seems to he more difficult than evaluating peers. A questionnaire 
was developed to measure students’ perceptions of the self- and peer 
assessment. A sample of 27 students administered the questionnaire. It was 
found that, on one hand, students are positive about self- and peer 
assessment as stimulating deep- level thinking and learning, critical thinking, 
and structuring the learning process in the tutorial group. On the other hand, 
the students have mixed feelings about being capable of assessing each other 
in a fair way. Most of them do not feel comfortable in doing so (Segers & 
Dochy, 2001). 

3.2.3 Over All Test 

In the Maastricht case of Segers and Dochy (2001) their investigation, a 
written examination, namely the OverAU Test, was used to assess the extent 
to which students are able to define, analyse, and solve novel, authentic 
problems. It was found that the mean score on this test was between 30% 
and 36%, with a standard deviation from 11 to 15. This implies that the 
students master on average one- third of the learning goals measured. Staff 
perceived these results as problematic. Two topic checklists were used to 
assess, the extent to which the OverAU Test measures the curriculum as 
planned (curriculum validity) and the curriculum as implemented in practice 
(instructional validity). The results suggest that there is an important degree 
of overlap between the formal and the operational curriculum in terms of 
concepts studied. Additionally, there is an acceptable congruence between 
the assessment practices in terms of goals assessed and the formal and 
operational curriculum. Thus, the OverAU Test seems to have a high 
instructional validity. Through the analysis of think- aloud protocols of 
students handling real- life problems, confirmatory empirical evidence of 
criterion validity was found. This type of validity refers to the question of 
whether a student’s performance on the OverAU Test has anything to do 
with professional problem- solving. For staff, the central question remained 
why students did not perform better on the OverAU Test. Therefore, 
students’ perceptions of the learning- assessment environment were 
investigated. A student evaluation questionnaire was administered to 100 
students. The students’ negative answer to the statement “the way of 
working in the tutorial group fits the way of questioning in the OverAU Test” 
particularly struck the staff as contradictory. Although empirical evidence of 
curriculum validity was found, students did not perceive a match between 
the processes in the tutorial group and the way of questioning in the OverAU 
Test. Staff regarded this perception as a serious issue, particularly because 
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working on problems is the main process within problem- based learning 
environments. In order to gain more insight into these results, semi- 
structured interviews were done in four groups (n total = 33). The students 
indicated that the other assessment instruments of the curriculum mainly 
measured the reproduction of knowledge. Students felt that for the OverAll 
Test, they had to do more; they had to build knowledge instead of merely 
reproducing it. The tutorial group was perceived as not effectively preparing 
students for the skills they need for the OverAll Test. Too many times, 
working in the tutorial groups was perceived as running from one problem to 
another, without really discussing the analysis and the solution of the 
problem, based on what was found in the literature. The students also 
indicated that they had problems with the novelty of the problems. During 
the tutorials, new examples, with slight variations to the starting problem are 
seldom discussed. The students suggested more profound discussions in the 
tutorial groups, and that analysing problems should be done in a more 
flexible way. In one of the modules, a novel case was structurally 
implemented and discussed in the tutorial groups based on a set of questions 
similar to the OverAll Test questions. Students valued this procedure, and 
felt the need to do this exercise in flexible problem analysis, structurally in 
all modules (Segers & Dochy, 2001). 

From both cases, the Louvain and the Maastricht case, it can be 
concluded that there is a mismatch between the formal learning environment 
as planned by the teachers and the actual learning environment as perceived 
by the students. Students’ perceptions of the learning- assessment 
environment, based on former learning experiences and their recent 
experiences, have an important influence on their learning strategies and 
affect the quality of their learning outcomes. Therefore, they are a valid 
input for understanding why promises are not fulfilled. Moreover, looking 
for students’ perceptions of the learning- assessment environment seems to 
be a valid method to show teachers ways to improve the learning- 
assessment environment (Segers & Dochy, 2001). 

3.2.4 Simulation 

Edelstein, Reid, Usatine and Wilkes (2000) conducted a study to assess 
how computer- based case simulations (CBX) and standardised patient 
exams (SPX) compare with each other and with traditional measures of 
medical students’ performance. Both SPX and CBX allow students to 
experience realistic problems and demonstrate the ability to make clinical 
judgements without the risk of harm to actual patients. The object of the 
study was to evaluate the experiences of an entire senior medical school 
class as they took both traditional standardised examinations and new 
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performance examinations. In a quantitative study, 155 fourth- year students 
of the School of Medicine at the University of California, were assigned two 
days of performance examinations. After completing the examinations, the 
students filled in a paper- and- pencil questionnaire on clinical skills. The 
examination scores were linked to the survey and correlated with archival 
student data, including traditional performance indicators. It was found that 
the CBX and the SPX had low to moderate statistically significant 
correlations with each other and with traditional measures of performance. 
Traditional measures inter-correlated at higher levels than with CBX or SPX. 
Students’ perceptions of the various types varied based on the assessment. 
Students’ rankings of relative merits of the examinations in assessing 
different physician attributes evidenced that performance examinations 
measure different physician competency domains. Students individually and 
in subgroups do not perform the same on all tests, and they express 
sensitivity to the need for different purposes. The use of multiple evaluation 
tools allows finer gradations in individual assessment. A multidimensional 
approach to evaluation is the most prudent (Edelstein et ah, 2000). 

3.2.5 Evaluation Versus Continuous Assessment 

In his study, Kniveton (1996) asked students what qualities they 
perceived in continuous assessment and examinations. The important 
question is not what students “like”, but what they feel are strengths and 
weaknesses of various types of assessment. Subjecting the student to an 
assessment procedure that the student can react to positively may well be an 
important contributor to a student’s success, and the use to which a particular 
assessment technique can be put will to some extent depend on the student’s 
perceptions on it. A questionnaire, with 47 questions of which 46 are 
answerable on a 9-point scale, concerning what students considered 
characteristics of the different types of assessment, was used. This 
instrument was administered to 292 undergraduates in human, environmental 
and social studies departments in 2 universities. It is the purpose of the 
research to examine and compare the perceptions of students taking a 
number of degrees, giving equal weight to the variables of age and gender. 
The overall view of the students was that continuous assessment should not 
be involved in much more than half of their grade measurement. Although 
assessment techniques are seen as fairer and measuring a range of abilities, 
this finding does not indicate an overwhelming endorsement of continuous 
assessment, nor does it indicate a total rejection of the idea of examinations. 
There are a number of sub- group differences found. First, there are a 
number of aspects of assessment where there is an interaction between 
gender and age. Mature males and younger females tend to regard 
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continuous assessment as having many advantages over examinations. 
Younger male and mature female students are far less positive about 
continuous assessment. At a second level, there are a number of aspects of 
assessment where mature male students more than other groups feel that 
aspects of continuous assessment are extremely positive. On average mature 
males want the most continuous assessment and younger males the least 
(Kniveton, 1996). 

3.2.6 General Perceptions About Assessment 

A series of studies do not focus on students’ perceptions about specific 
modes of assessment but more generally investigate students’ perceptions 
about assessment. The study of Drew (2001) illustrates students’ general 
perceptions about the value and purpose of assessment. Within the context of 
new modes of assessment, the Northumbria Assessment studies are often 
cited. In these studies, different aspects of perceptions of students about new 
modes of assessment are elaborated upon: the consequential validity of 
alternative assessment and its (perceived) fairness, but also the relations 
between teacher’s messages and student’s meanings in assessment, and the 
hidden curriculum are investigated. 

3.2.6.1 What Helps Students Learn and Develop in Education 

Drew (2001) describes the findings of a series of structured group 
sessions, which elicited students’ views on their learning outcomes, and 
what helped or hindered their development. 

The process of amended session consisted of: (1) small sub- group 
discussions, (2) general discussions in the whole group, and (3) students’ 
individual views in writing. The amended session was run with 14 course 
groups in Sheffield Hallam University, with a total of 263 students. Each 
session generated an amount of qualitative data, in the form of student- 
generated flip charts and individually written views. The students’ comments 
about what helped or hindered the development of their learning outcomes 
are the focus of the researcher. The findings suggest that there are three areas 
(i.e. three contextual factors) that, together, comprise the context in which 
students learn, and which have a strong influence on how and if they learn: 
(1) course organisation, resources and facilities, (2) assessment, and (3) 
learning activities and teaching. Set within this context is the student and his 
use of that context (i.e. four student- centred factors), relating to (a) students’ 
self- management, (b) students motivation and needs, (c) students 
understanding and (d) students need for support. 

Drew (2001) found following results on the four student- centred factors: 
(a) Students’ self- management. Autonomy and responsibility for their 
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learning were themes emerging through students’ comments. Students 
acknowledge the importance of operating autonomously. They liked “to he 
treated like adults”, (h) Students’ motivation and needs. The students felt it 
was important for allowances to he made for their individual needs, hut 
considered that lectures often assumed their needs were identical. Students 
thought it was dangerous to assume that all students on a course shared 
interests and aspirations. Subjects needed to he pitched at their level, (c) 
Students’ understanding. The students wanted to grasp principles and 
concepts, rather than detail, saw dangers in merely memorising information 
and thought that understanding the aims for a subject helped them to handle 
it. Students saw reflection as valuable and important for understanding, (d) 
Students’ need for support. Personal, but especially academic needs for 
support were mentioned. Students wanted it to reduce uncertainty and 
anxiety, and saw support as taking a variety of forms, for example: clear 
structures, guidance and personal contact (Drew, 2001). 

Within the context of “assessment”, the second contextual factor and the 
focus of this review, these student- centred factors occur as follows: students 
valued self-management and, generally, examinations were seen as less 
supportive of its development. Dead li nes were not seen in themselves as 
unhelpful. They developed self- discipline, the ability to work underpressure 
and increased determination, but they were also seen as indicating when to 
work, rather than when work was to be completed. Assessment, seen by the 
students as a powerful motivator, was regarded as a major vehicle for 
learning. However, a heavy workload could affect the depth at which they 
studied and, in some courses, students thought it should be lessened so that 
“work doesn’t just wash over students”. In order to help them learn, students 
wanted to know what was expected- clear briefs and clear assessment 
criteria. Students closely linked the provision of feedback with support. 
Effective feedback was critical to “build self confidence, help us evaluate 
ourselves” and students wanted more of it. Students preferred 1 : 1 tutorials as 
a method to provide effective feedback, but they knew that staff pressures 
made this difficult. They disliked one- line comments and saw typed 
feedback sheets as excellent (Drew, 2001). 

3.2.6.2 But Is It Fair: Consequential Validity of Alternative 
Assessment 

Sambell, McDowell and Brown (1997) conducted a qualitative study of 
students’ interpretations, perceptions and behaviours when experiencing 
forms of alternative assessment, in particular its consequential validity (i.e. 
the effects of assessment on learning and teaching). The “Impact of 
Assessment” project has employed the case study methodology. Data were 
gathered from thirteen case studies of alternative assessment in practice. The 
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methods for collecting these data included interviewing both staff and 
students, ohservation and examination of documentary evidence, hut the 
emphasis was on semi-structured (group) interviews with students. A staged 
approach to interviewing was used, so that respondents’ perceptions and 
approaches were explored over the period of the assessment, from the initial 
assessment hriefings at the heginning of a unit of learning to post- 
assessment sessions. Initial analysis of the data was conducted at the level of 
the case, which resulted in summary case reports. Individual case analysis 
was followed hy cross- case analysis (Samhell et ah, 1997). 

3.2.6.2.1 Effects of Student Perceptions of Assessment on the Process 
of Learning 

Broadly speaking, it was discovered that students often reacted very 
negatively when they discussed what they regarded as “normal” or 
traditional assessment. One of the most commonly voiced complaints 
focused upon the perceived impact of traditional assessment on the quality of 
learning achieved. Many students expressed the opinion that normal 
assessment methods had a severely detrimental effect on the learning 
process. Exams had little to do with the more challenging task of trying to 
make sense and understand their subject. By contrast, when students 
considered new forms of assessment, their views of the educational worth of 
assessment changed, often quite dramatically. Alternative assessment was 
perceived to enable, rather than pollute, the quality of learning achieved. 
Many made the point that for alternative assessment they were channelling 
their efforts into trying to understand, rather than simply memorise or 
routinely document, the material being studied. Yet, although all the students 
interviewed felt that alternative assessment implied a high- quality level of 
learning, some recognised that there was a gap between their perceptions of 
the type of learning being demanded and their own action. Several claimed 
they simply did not have the time to invest in this level of learning and some 
freely admitted they did not have the personal motivation (Samhell et ah, 
1997). 

3.2.6.2.2 Perceptions of Authenticity in Assessment 

Many students perceived traditional assessment tasks as arbitrary and 
irrelevant. This did not make for effective learning, because they only aimed 
to learn for the purposes of the particular assessment, with no intention of 
maintaining the knowledge for the long- term. Normal assessment was seen 
as something they had to endure, not because it was interesting or 
meaningful in any sense other than it allowed them to accrue marks, an 
unavoidable evil. Normal assessment activities are described in terms of 
routine, dull artificial behaviour. Traditional assessment is believed to be 
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inappropriate as a measure, because it appeared, simply to measure your 
memory, or in case of essay- writing tasks, to measure your ability to 
marshal lists of facts and details. Students repeatedly voiced tbe belief that 
the example of alternative assessment under scrutiny was fairer than 
traditional assessment, because by contrast, it appeared to measure qualities, 
skills and competencies that would be valuable in contexts other than the 
immediate context of assessment. In some of the cases, the novelty of the 
assessment method lay in the lecturer’s attempt to produce an activity that 
would simulate a real life context, so students would clearly perceive the 
relevance of their academic work to broader situations outside academia. 
This strategy was effective and the students involved highly valued these 
more authentic ways of working. Alternative assessment enabled students to 
show the extent of their learning and allowed them to articulate more 
effectively and precisely what they had assimilated throughout the learning 
program (Sambell et al., 1997). 

3.2,6.2,3 Student Perceptions of the Fairness of Assessment 

The issue of fairness, from the student perspective, is a fundamental 
aspect of assessment, the crucial importance of which is often overlooked or 
oversimplified from the staff perspective. To students, the concept of 
fairness frequently embraces more than simply the possibility of cheating: it 
is an extremely complex and sophisticated concept that students use to 
articulate their perceptions of an assessment mechanism, and it relates 
closely to our notions of validity. Students repeatedly expressed the view 
that traditional assessment is an inaccurate measure of learning. Many made 
the point that end- point summative assessments, particularly examinations 
that took place only on one day, were actually considerably down to luck, 
rather than accurately assessing present performance. Often students 
expressed concern that it was too easy to leave out large portions of the 
course material, when writing essays or taking exams, and still do well in 
terms of marks. Many students felt unable to exercise any degree of control 
within the context of the assessment of their own learning. Assessment was 
done to them, rather than something in which they could play an active role. 
In some cases, students believed that what exams actually measured was the 
quality of their lecturer’s notes and handouts. Other reservations that 
students blanketed under the banner of “unfairness”, included whether you 
were fortunate enough to have a lot of practice in any particular assessment 
technique in comparison with your peers (Sambell et al., 1997). When 
discussing alternative assessment, many students believed that success more 
fairly depended on consistent application and hard work, not a last minute 
burst of effort or sheer luck. Students use the concept of fairness to talk 
about whether, from their viewpoint, the assessment method in question 
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rewards, that is, looks like it is going to attach marks to, the time and effort 
they have invested in what they perceive to be meaningful learning. 
Alternative assessment was fair because it was perceived as rewarding those 
who consistently make the effort to learn rather than those who rely on 
cramming or a last- minute effort. In addition, students often claimed that 
alternative assessment represents a marked improvement: firstly in terms of 
the quality of the feedback students expected to receive, and secondly, in 
terms of successfully communicating staff expectations. Many felt that 
openness and clarity were fundamental requirements of a fair and valid 
assessment system. There were some concerns about the reliability of self 
and peer assessment, even though students valued the activity (Sambell et 
ah, 1997). 

3,2,6.3 The Hidden Curriculum: Messages and Meanings in 
Assessment 

Sambell and McDowell (1998) focus upon the similarities and variations 
in students’ perspectives on assessment, based on two levels of data analysis. 
At the first level, the whole dataset was used to examine the alignment 
between the lecturers’ stated intentions for the innovation in assessment and 
the “messages” students received about what they should be learning and 
how they should go about it, in order to fulfil their perceptions of the new 
assessment requirements. This level revealed that, at surface levels, there 
was a clear match between statements made by staff and the “messages” 
received by students. Several themes emerged, indicating shifts in students’ 
characterizations of assessment. First, students consistently expressed views 
that the new assessment motivated them to work in different ways. Second, 
that the new assessment was based upon a fundamentally different 
relationship between staff and students, and third, that the new assessment 
embodied a different view of the nature of learning. At the second stage of 
analysis, data were closely investigated on the level of the individual, to look 
for contradictory evidence, or ways in which, in practice, students expressed 
views of assessment which did not match these characterizations, and in 
which the surface- level close alignment of formal and hidden curriculum 
was disrupted in some way. It was found that students have their individual 
perspectives, all of which come together to produce many variants on a 
hidden curriculum. Students’ motivations and orientations to study influence 
the ways in which they perceive and act upon messages about assessment. 
Students’ views of the nature of academic learning influence the kinds of 
meaning they find in assessment tasks and whether they adopt an approach 
to learning likely to lead to understanding or go through the motions of 
changing their approach (Sambell & McDowell, 1998). Students’ 
characterizations of assessment, based on previous experience, especially in 
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relation to conventional exams, also, strongly influence their approach to 
different assessment methods. In an important sense, this research makes 
assessment problematical, because it suggests that students, as individuals, 
actively construct their own versions of the hidden curriculum from their 
experiences with and characterizations of assessment. This means that the 
outcomes of assessment as “lived” by students are never entirely predictable, 
and the quest for a “perfect” system of assessment is, in one sense, doomed 
from the outset (Sambell & McDowell, 1998). 

3.2,7 Implications for Teaching and Assessment Practice 

Previous educational research on students’ perceptions about 
conventional evaluation and assessment practices, namely multiple choice 
and essay typed examinations, evidence that students perceive the multiple 
choice format as more favourable than constructed response/ essay items on 
following dimensions: perceived difficulty, anxiety, complexity, success 
expectancy and feeling at ease (Zeidner, 1987). Within these groups of 
students, some remarkable differences are found. Students with good 
learning skills and students with low test anxiety rates, both seem to favour 
the essay type exams (Birenbaum & Feldman, 1998). This type of 
examination goes together with deep(er) approaches to learning than 
multiple-choice formats (Entwistle & Entwistle, 1991). 

When compared to alternative assessment, these perceptions about 
conventional assessment formats seem to contradict strongly the students’ 
more favourable perceptions towards alternative methods. Overall, learners 
think positive about new assessment strategies, such as portfolio assessment, 
peer assessment, simulations and continuous assessment methods. 

Although students acknowledge the advantages of these methods, some 
of the students’ comments put this overall positive image of alternative 
assessment methods into perspective. Different examination or task 
conditions can interfere. Eor example, “reasonable” work- load is a pre- 
condition of good studying and learning (Chambers, 1992). Sometimes, a 
mismatch was found between the formal curriculum as intended by the 
educator and the actual learning environment as perceived by the students. 
Eurthermore, different assessment methods seem to assess various skills and 
competencies. It is important to value each assessment method, within the 
learning environment for which it is intended, and taking its purposes and 
skills to be assessed into consideration, as well as the cost- benefit profde of 
each different mode. Eor example, is it appropriate to adapt the assessment 
automatically to (each of) the student’s preferences? Regarding your 
instruction and your assessment method as integrated parts, do they have the 
same or compatible purposes? How about the time investment for the 
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students and/ or the teacher’s time investment on this particular assessment 
type? 

In addition, methodological issues, like a poor operational 
implementation of the assessment mode or format, can give rise to biased 
results about students’ perceptions on several types of assessment. Any 
assessment done poorly will result in poor results. Therefore, it is important 
to consider the research methodological design when interpreting the 
findings, and certainly when assessing, evaluating and changing teaching 
practices. Further research is needed to verify and consolidate the results of 
these investigations. 

3.3 Effects of Perceptions about Assessment on Student 
Learning 

As we have already shown, students’ perceptions about assessment have 
an important influence on students’ approaches to learning. However, are 
those the only influences? We studied the effects of students’ perceptions 
about assessment on their learning, and thus be in a position to provide an 
answer to our third and final review question. 

3.3.1 Test Anxiety 

Test anxiety can have severe consequences for the student’s learning 
outcomes. In this section, the intrusive thoughts and concerns of the student 
with(out) test anxiety are investigated. 

3.3.1.1 Nature of Test Anxiety 

Sarason (1984) analysed the nature of test anxiety and its relationships to 
performance and cognitive interference from the standpoint of attentional 
processes. The situations to which a person reacts with anxiety may be either 
actual or perceived. The most adaptive response to stress is task- oriented 
thinking, which directs the individual’s attention to the task at hand. The 
task- oriented person is able to set aside unproductive worries and 
preoccupations. The self- preoccupied person, on the other hand, becomes 
absorbed in the implications and consequences of failure to meet situational 
challenges. The anxious person’s negative self- appraisals are not only 
unpleasant to experience, they also have undesirable effects on performance 
because they are self- preoccupying and detract from task concentration. 
Sarason (1984) conducted three studies, concerning an instrument. Reactions 
To Tests (RTT), designed to assess multiple components of a person’s 
reactions to tests, to correlate those components with intellective 
performance and cognitive interference, and to attempt experimentally 
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influence these relationships. In the first study, a pool of items (Test Anxiety 
Scale) dealing with personal reactions to tests was constructed and 
administered to 390 introductory psychology students. The findings of this 
study indicate the existence of four discriminahle components of test 
anxiety: Tension, Worry, Test- Irrelevant Thinking, and Bodily Reactions. 
Based on of these findings, a new instrument, the Reactions To Tests 
questionnaire, was developed and administered to 385 psychology students. 
This second study was conducted to obtain information about the scales’ 
psychometric properties and to determine their relationships to cognitive 
interference. The subjects first filled in the RTT and the TAS, then they were 
given a difficult version of the Digit Symbol Test and immediately after this, 
they responded to the Cognitive Interference Questionnaire (CIQ). It was 
found that the Worry scale related negatively to performance and related 
positively to cognitive interference and thus that test anxiety is best 
conceptualised in terms of worrisome, self- preoccupying thoughts that 
interfere with task performance. The third study was carried out in an effort 
to compare groups that differ in the tendency to worry about tests after they 
have received either (1) instructions directing them to attend completely to 
the task on which they will perform, or (2) a reassuring communication prior 
to performing the task. From a group of 612 students who responded to the 
RTT, 180 introductory psychology students were selected for participation in 
the experiment. The findings show that reassuring instructions have different 
effects for subjects who score high, moderate and low on the Worry scale, 
especially “worriers” seem to have advantage of the reassuring instructions 
prior to the performance task. There is a detrimental effect of reassurance on 
the students who score low on the Worry scale. This may be due to the 
student’s interpretation of the reassuring communication as the task being to 
lightly. This might lower their motivational level and as a consequence, their 
performance. The attention- directing condition seems to have all the 
advantages that reassurance has for high Worry scale scorers with none of 
the disadvantages. The performance levels of all groups receiving these 
instructions, were high. The attention- directing instructions seemed to 
provide students with an applicable coping strategy. The results of the 
present studies suggest, at least in evaluation situations, anxiety is to a 
significant extent, a problem of intrusive, interfering thoughts that diminish 
attention to and efficient execution of the task. Under neutral conditions, 
high and low test- anxious subjects perform comparably. The study 
evidenced that it is possible to influence these thoughts experimentally. 
People who are prone to worry in evaluative situations benefit simply from 
their attention being called to the importance of maintaining a task focus. 
Reassurance, calming statements geared to reduce the general feeling of 
upset that people experience in threatening situations, can be 
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counterproductive, especially for students with low and moderate anxiety 
scores (Sarason, 1984). 

3.3.1.2 Test Anxiety in Relation to Assessment Type and Academic 
Achievement 

The main objective of Zoller and Ben- Chaim (1988) was to study the 
interaction between examination type, test anxiety and academic 
achievement within an attempt at reducing the test anxiety of students in 
college science- through the use of those kinds of examinations preferred by 
them- and thus, hopefully, to improve their performance accordingly. The 
Stait Trait Anxiety Inventory (STAI; Spielberger Gorsuch, & Lushene, 
1970) and the Type Of Preferred Examinations (TOPE) questionnaire were 
administered to 83 college science students. In this latter questionnaire, 
students’ preferences and the reasons accompanying these preferences were 
assessed for several traditional and non- traditional examinations. The most 
preferred types of examinations are those in which the use of any supporting 
material (i.e. notes, textbooks, tables) during the examination is permitted, 
and the time duration of the exam is practically unlimited, in particular: (1) 
take home exam, any material may be used, and (2) written exam in class, 
time unlimited, any supporting material is allowed. Students emphasise the 
importance of the examination as a learning device, to enhance 
understanding, thoroughness and analysis, rather than superficial rote 
learning and memorisation. As expected, it was found that students believe 
that compared with the conservative paper- and- pencil- type examinations, 
the written examinations with open books either in class or at home, reduce 
tension and anxiety, improve performance, and are therefore perceived to be 
preferable. Students also claimed to have difficulty expressing themselves 
orally. Eurthermore, it is significant that most of the science students, 
regardless of their year of study, are convinced and strongly believe that the 
type of the final examination crucially affects their final grade in the course. 
It also appeared that students’ state of anxiety in the finals is higher than in 
the midterms for all four science courses. Einally, there is an important 
gender effect: the state anxiety level of female science students under test 
situations seems to be consistently higher, compared with that of male 
science students. If these findings are compared with a preliminary survey of 
the college science professors concerning the issue of examinations, a 
remarkable result is attained. Although teachers know precisely the types of 
examinations preferred by the students, each professor continues, 
persistently, to give the students the same one type of examination, which he 
prefers, or considers to be the most appropriate for his needs, regardless of 
the students’ preferences. Moreover, there exists no tendency among the 
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science professors to divert from their “pat” exa mi nation type or to modify 
even slightly (Zoller & Ben-Chaim, 1988). 

In a mini case study, Zoller and Ben-Chaim (1988) compared the 
traditional class examination (written exam in class, time limited, no 
supporting material is allowed) with the non- traditional take home exam 
(any material may he used) concerning the interaction between examination 
type, (test) anxiety state, and academic performance. The examination was 
divided in two equivalent parts that were administered in class, and as a take 
home exam a day apart. Each exam was accompanied hy the administration 
of the State Anxiety Inventory questionnaire just before the initiation of the 
exam itself. A negative correlation between test anxiety and academic 
achievements was found. The lower the level of state anxiety is, the higher 
the students’ achievements are, the difference being statistically significant. 
In particular, the group of low achievers gained significantly more in 
academic achievement in the “take home” exam, compared with the group of 
high achievers, whereas the level of state (test) anxiety of the low achievers 
decreased considerable. There was no gain in achievement of the group of 
high achievers in the take home exam nor a significant change in their state 
anxiety (Zoller & Ben-Chaim, 1988). 

3.3.2 Student Counselling 

Student counselling is often claimed to be a potential method to cope 
with high levels of distress. But is it? What are students’ perspectives? 
Rickinson (1998) examined students’ perceptions about their experienced 
distress and about the effects of student counselling on this distress, and 
related them to the student’s degree completion. The study explores 
undergraduate students’ perceptions of the level of distress they experience 
at two important transition points: first year entry and final year completion, 
and the impact of this distress on their academic performance. In addition, 
the effectiveness of counselling intervention in ameliorating this distress, 
and in improving students’ capacity to complete their degree programs 
successfully, is discussed. During a four- year study, the relationship 
between undergraduate student counselling and successful degree 
completion is investigated. First, the research examined the effectiveness of 
counselling intervention at the first year transition point in relation to student 
retention and subsequent completion. Students were categorised into risk 
groups according to their level of commitment/ risk of leaving. Of the 44 
students identified as “high risk”, only 15 students accepted counselling 
intervention. At their initial counselling intervention all 15 students were 
assessed as having significant difficulty with academic and social integration 
into the university. All students attended the full workshop program and 
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reported that the workshops had helped them to develop strategies for 
managing their anxiety and for “settling in” socially and academically. Of 
the 15 students, 11 achieved an upper second class degree, three achieved a 
lower second class degree and one achieved a third class honours degree. 
Second, the study focused on final year students, investigating both the 
impact of high levels of psychological distress on their academic 
performance and the effectiveness of counselling intervention in relation to 
degree completion. For final year students a self- completion questionnaire 
was chosen as the most practical method of assessing the perceived effect of 
students’ problems on their academic performance both prior to, and 
following, counselling intervention. The self- completion questionnaires 
were administered to a selected sample of 43 undergraduates who used the 
counselling service, together with the SCL- 90- R, a psychometric 
instrument. Of this sample, 30 students had self- referred and 13 were 
referred via their tutor or doctor. Almost all students (n= 41) perceived their 
academic performance as having been affected by their problems prior to the 
counselling intervention. Following counselling, students recorded their 
perception of the degree of change in their academic performance and the 
degree to which they felt better able to deal with their problems. Of the 43 
students, 39 thought that their academic performance had improved 
following counselling and 42 students recorded that counselling had assisted 
them to deal more effectively with their problems. All 43 students completed 
their degree programs successfully. This study highlights the educational 
implications of high levels of psychological distress for undergraduate 
students. The university learning process, by providing the stimulus of new 
knowledge and experience, challenges students’ existing level of 
development. To take full advantage of this developmental opportunity, 
students need to tolerate the temporary loss of balance. Counselling 
intervention was shown to be effective in facilitating student retention and 
completion. Counselling assisted students at risk of leaving, to adjust to the 
new social and academic demands of the university environment. 
Subsequently, these students progressed to successful degree completion. At 
the second transition point, the results strongly suggest that counselling 
intervention was instrumental in reducing the level of psychological distress 
of the final year students (Rickinson, 1998). 

3.3.3 Cheating and Plagiarism 

Do students’ perceptions about cheating and plagiarism have important 
consequences for students’ cheating behaviour and student learning? We 
tried to find an answer. 
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Ashworth and Bannister (1997) conducted a qualitative study to discover 
students’ perceptions of cheating and plagiarism in higher education. The 
study tries to elicit how cheating and plagiarism appear from the perspective 
of the student. Nineteen interviews were carried out as a coursework toward 
the end of a semester-long Master’s degree unit in qualitative research 
interviewing. The work was undertaken by the course members, who 
interviewed one student and completed a full analysis and report on that one 
interview. 

Further analysis was done by the researchers. A first important result is 
that there is a strong moral basis in students’ views on cheating and 
plagiarism, which focus on values as friendship, interpersonal trust and good 
learning. Practices that have a detrimental effect on other students are 
particularly serious and reprehensible. The ethic of peer loyalty is a 
dominant one. It appears that the “official” university view of cheating is not 
always appropriate. This means that some punishable behaviour can be 
regarded as justifiable and some officially approved behaviour can be felt to 
be dubious. Another interesting finding is that the notion of plagiarism is 
regarded as extremely unclear. Students are unsure about precisely what 
should and should not be assigned to this category. Doubt over what is 
“officially” permitted and what is punishable, appeared to have caused 
considerable anxiety. Some students have a fear that they might plagiarise 
unwittingly in writing what they genuinely take to be their own ideas, that 
plagiarism might occur by accident. Controversy, cheating which is 
extensive and intended and leading to substantial gain is seen as the most 
serious. In this respect, examination cheating is seen as more serious than 
coursework cheating. Finally, the study revealed that factors such as 
alienation from the university due to lack of contact with staff, the impact of 
large classes, and the greater emphasis on group learning are perceived by 
students themselves as facilitating and sometimes excusing cheating. For 
example, different forms of assessment offer different opportunities for 
cheating. The informal context in which coursework exercises are completed 
means there is an ample scope to cheat through collusion and plagiarism, in 
contrast to the controlled, invigilated environment of unseen examinations. 
This study reveals the importance of understanding the students’ perspective 
on cheating and plagiarism; this knowledge can significantly assist 
academics in their efforts to communicate appropriate norms. Without a 
basic commitment on the part of the students to the academic life of the 
institution, there is no moral constraint on cheating or plagiarism (Ashworth 
& Bannister, 1997). 

Franklyn-Stokes and Newstead (1995) conducted also two studies on 
undergraduate cheating. The first study was designed to assess staff and 
students’ perceptions of the seriousness and frequency of different types of 
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cheating. Because of the sensitivity of the topic under investigation, it was 
decided not to report their own cheating, instead they were asked to estimate 
how frequently they thought cheating occurred in their year group. A sample 
of 112 second- year students and 20 staff members were administered a 
questionnaire, who asked them to rate the frequency and seriousness of each 
type of a set of cheating behaviours. An inverse relationship between 
perceived frequency and seriousness of cheating behaviour was found: the 
types of cheating behaviour rated as most serious, were also rated as the least 
frequent. Cheating behaviour that was examination- related, was rated most 
serious and least frequent, while coursework- related cheating behaviours 
were rated least serious and occurred most frequent. There were considerable 
staff/ student differences in the seriousness and frequency ratings. There was 
no behaviour that students rated significantly more serious than did staff. 
The differences for frequency were even more marked. Students rated every 
type of behaviour as occurring more frequently and this difference was 
significant for 19 out of the 22 types of cheating behaviours in the 
questionnaire. In addition, an important age effect was found for students 
perceptions of cheating. The 25-i- students rated cheating significant more 
serious and as occurring significantly less frequently, than did younger 
peers. There were no significant gender differences. In their second study, 
Franklyn-Stokes and Newstead (1995) utilised this set of cheating 
behaviours to elicit undergraduates’ self- reports and reasons for (or not) 
indulging in each type of behaviour. The questionnaire required subjects to 
say whether they had indulged in each type of behaviour as an 
undergraduate. Then, they were asked to select a raison for indulging (or 
not) this type of cheating. Finally, in an open question, the students were 
asked to give the main raison why they were studying for a degree. The 
questionnaire was completed by 128 students from two science departments 
in the same university. It was found that the overall occurrence of cheating 
largely corroborated the findings from the first study regarding the frequency 
of occurrence of each type of behaviour. There was no significant gender 
effect. On the contrary, the difference in reported cheating by age was 
significant. The 18- 20 year- olds reported an average cheating rate of 30%, 
the 21-24 year- olds one of 36% and the 25-i- students also reported an 
average cheating rate of 30%. The reasons for cheating and for not cheating 
varied to a considerable extent in relation to the type of behaviour. There 
was no relationship between the reason students gave for studying for a 
degree and the amount of cheating they admitted to. These two studies 
suggest that more than half of the students are involved in a range of 
cheating behaviours, including: allowing coursework to be copied, 
paraphrasing without acknowledgement, altering and inventing data, 
increasing marks when students mark their own work, copying another’s 
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work, fabricating references and plagiarism from text. Other important 
results are that cheating occurs more in relation to coursework than with 
examinations and that although mature students perceive cheating as less 
frequent and more serious, their self- reported frequency of occurrence was 
the same as that for 18-20 year- olds. As to the reasons why students cheat, 
the principal ones are time pressure and desire to increase the mark. The 
most common reason for not cheating, are that it was unnecessary or that it 
would have been dishonest. Clearly, cheating may occur more frequently 
than staff seem to be aware of, and it is not seen as seriously by students as it 
is by staff (Franklyn-Stokes & Newstead, 1995). 

3.3.4 Implications for Teaching and Assessment Practice 

Students’ perceptions about assessment seem to have an important 
influence on student learning. Test anxiety and its accompanying intrusive 
thoughts and concerns about possible consequences of the test, have a 
detrimental influence on students’ learning outcomes. Simple attention- 
directing instructions from the teacher, can equip the test anxious student 
with an appropriate coping strategy. In addition, the assessment type can 
reduce the level of test anxiety. Furthermore, students thought that 
counselling positively changed their academic performance and they felt 
they were better able to deal more effectively with their problems. Students’ 
perceptions about cheating and plagiarism do seem to have an influence on 
student learning. For example, the higher the perceived seriousness of the 
cheating behaviour, the lower the frequency and the lower the perceived 
seriousness, the more frequent the cheating behaviour was. Students’ 
perceptions have in this respect an important additional value when 
considering teaching and assessment practices. 



4. METHODOLOGICAL REFLECTIONS 

Traditionally research with regard to human learning, was done from a 
first order perspective. This research emphasised the description of different 
aspects of reality; reality per se. Research on students’ perceptions turned the 
attention to the learner and certain aspects of his/her world. This approach is 
not directed to the reality as it is, but more to how people view and 
experience reality. It is called a second- order perspective. The accent of this 
second- order perspective is on understanding and not on explanation (Van 
Rossum & Schenk, 1984). Both qualitative and quantitative research has 
been conducted to reveal this second- order perspective. Especially the 
quantitative research concerning students’ perceptions about assessment had 
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a clear majority, 23 out of 36 studies were solely analysed by quantitative 
methods. Only 11 investigations, not yet one third of the reviewed studies, 
have been analysed qualitatively and two reviewed studies are both analysed 
quantitatively and qualitatively. Very popular methods for data collection 
within the quantitative research are the Likert type questionnaires (n= 35) 
and inventories (n= 7), for example: the Reaction To Test questionnaire 
(RTT) (Birenbaum & Feldman, 1998; Sarason, 1984), Clinical Skill Survey 
(Edelstein et ah, 2000), Assessment Preference Inventory (API) and the 
Motivated Learning Strategies Questionnaire (MLSQ) (Birenbaum, 1997). 
Only a relatively small number of surveys (n= 7) was done in response to a 
particular assessment task or, in response to a test or examination. Most 
other studies ask for students’ perceptions in more general terms, not related 
to the experiences with a specific assessment task. The most frequent used 
methods for data collection within the qualitative research, were open 
questionnaires or written comments (n= 4), think- aloud protocols (n= 1), 
semi -structured interviews (n= 10), and focus group discussions (n= 4). 
Observations (n= 5) and research of document sources (n= 7) were 
conducted to collect additional information. The method of 
“phenomenography” (Marton, 1981) has been frequently used to analyse the 
qualitative data gathered. Differences in conceptualisation are systematically 
explored by a rigorous procedure in which the transcripts are categorised in a 
relatively small number of recognisable different categories, independently 
checked by another researcher. This procedure strengthens the value of this 
qualitative research, and allows connections to be made with quantitative 
studies (Entwistle, et al 2001). Most studies have a sample of 101 to 200 
subjects (n= 11) and from 31 to 100 persons (n= 9). A relatively high 
number of studies (n= 6) has a sample size of less than 30 students. Three, 
and five investigations have respectively a sample of 201 to 300 subjects and 
more than 300 persons. The sample size of the two case studies (n= 13 
cases), is unknown. 



5. OVERALL SUMMARY AND CONCLUSIONS 

Student learning is subject to a dynamic and richly complex array of 
influences which are both direct and indirect, intentional and unintended 
(Hounsell, 1997b). In this review, we had the purpose to investigate 
students’ perceptions about assessment in higher education and its influences 
on student learning and more broadly, the learning- teaching environment. 
Eollowing questions were of special interest to this review: (1) what are the 
influences of the (perceived) characteristics of assessment on students’ 
approaches to learning, and vice versa, (2) what are students’ perceptions 
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about different alternative assessment formats and methods, and (3) what are 
the influences of these students’ perceptions about assessment on student 
learning? 

In short, this review evidenced that students’ perceptions about 
assessment and its properties, have considerable influences on students’ 
approaches to learning and more in general, to student learning. Also, vice 
versa, students’ approaches to learning influence the ways in which students’ 
perceive assessment. 

Furthermore, it was found that students hold strong views about different 
formats and methods of assessment. Educational research revealed that 
within conventional assessment practices, students perceive the multiple 
choice format as more favourable than the constructed response/ essay items. 
Especially with respect to students’ perceptions on the dimensions of 
perceived difficulty, lower anxiety and complexity, and higher success 
expectancy, students give preference to this examination format. Curiously, 
over the past few years, multiple choice type tests have been the target of 
severe public and professional attack on various grounds. Indeed, the attitude 
and semantic profile of multiple choice exams emerging from the 
examinee’s perspective is largely at variance with the unfavourable and 
negative profile of multiple choice exams often emerging from some of the 
anti- test literature (Zeidner, 1987). However, within the group of students 
some remarkable differences are found. For example, students with good 
learning skills and students with low test anxiety rates, both seem to favour 
the essay type exams, while students with poor learning skills and low test 
anxiety have more unfavourable feelings towards this assessment mode. It 
was also found that this essay type of examination goes together with 
deep(er) approaches to learning than multiple choice formats. Some studies 
found gender effects, with females being less favourable towards multiple- 
choice formats than to essay examinations (Birenbaum & Feldman, 1998). 

When students discuss alternative assessment, their perceptions about 
conventional assessment formats, contradict strongly with the students’ more 
favourable perceptions towards alternative methods. Learners, experiencing 
alternative assessment modes, think positive about new assessment 
strategies, such as portfolio assessment, self and peer assessment, 
simulations. From students’ point of view, assessment has a positive effect 
on their learning and is fair when it (Sambell et al., 1997): 

• Relates to authentic tasks. 

• Represents reasonable demands. 

• Encourages students to apply knowledge to realistic contexts. 

• Emphasis the need to develop a range of skills. 

• Is perceived to have long- term benefits. 
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Furthennore, different assessment methods seem to assess various skills 
and competencies. The goal of the assessment has thus a lot to do with the 
type of assessment and the consequent impact on students’ perceptions. It is 
important to value each assessment method, within the learning environment 
for which it is intended, and taking its purposes and skills to he assessed into 
consideration. It is not desirable to apply new or popular assessment modes, 
without reflecting upon the characteristics, purposes and criteria of the 
assessment, and without considering the learning- teaching environment of 
which the assessment type is only one part shaping and modelling student 
perceptions. Other influences like characteristics of the student (e.g. the 
students’ motivation, anxiety level, approach to learning, intelligence, social 
skills, and former educational experiences) and properties of the learning- 
teaching environment (e.g. characteristics of the educator, the teaching 
method used, the resources available) have to be included. 

The literature and research on students’ perceptions about assessment is 
relatively limited. Besides the relational and semi-experimental studies on 
students’ approaches to learning and studying in relation to students’ 
expectations, preferences and attitudes towards assessment that is well 
known, especially the research on students’ perceptions about particular 
modes of assessment is restricted. Most results are consistent with the 
overall tendencies and conclusions. However, some inconsistencies and even 
contradictory results are revealed within this review. Further research can 
elucidate these results and can provide us with additional information and 
evidence on particular modes of assessment in order to gain more insight in 
the process of student learning. These findings can equip us with valuable 
information in trying to comply with the more “benign” approach and the 
pressures that it places in trying to maintain a truly (versus only “perceived”) 
and more valid system of assessment. Many of the research findings and 
possible solutions to assessment problems are good ideas, but they have to 
be applied with great care and knowledge of assessment in its full 
complexity. In this regard, it is important to view the first order perspective 
and the second order approach to the study of human learning and 
assessment as complementary. Ultimately, it is the interaction between the 
two perspectives that leads to the understanding of the assessment 
phenomenon. 

This review has tried and hopefully succeeded to provide educators with 
an important source of inspiration, namely students’ perceptions about 
assessment and its influences on student learning, which can guide them in 
their reflective search to improve their teaching and assessment practices, 
and as a consequence, to achieve a higher quality of education. 
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1. INTRODUCTION 

Palchikov and her colleagues (Palchikov, 1995; Palchikov & Boud, 1989; 
Palchikov & Goldfinch, 2000) differentiated between self-assessment and 
peer assessment and illustrated that student involvement in assessment 
typically requires them to use their own criteria and standards to make their 
judgments. Palchikov and Goldfinch maintained that student assessment is a 
clear manifestation of instruction set up according to the principles of social 
constructivism. This new form of instruction requires students to learn from 
and with each other. A marked advantage of this socially situated form of 
instruction is that it naturally elicits peer assessment and self-assessment 
through reflection and self-reflection, even in the absence of marking and 
grading. It is encouraging that both meta-analytic studies on assessment in 
higher education, namely the one conducted by Palchikov and Boud on self- 
assessment and the more recent review conducted by Palchikov and 
Goldfinch on peer assessment, confirmed that students' assessments are more 
accurate when the criteria for judgement are explicit and well understood. 
This finding does not come as a surprise since self-assessment and peer 
assessment are new skills that students must acquire and learn to use in the 
context of skill acquisition. On comparing the outcomes of the two meta- 
analyses, Palchikov and Goldfinch came up with an intriguing difference. 
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They revealed that in the self-assessment study the students' ratings in high 
level courses were more similar to their teachers' than in the low level 
courses. Also, student assessors showed more agreement with their teachers 
in the area of science. Neither course-level differences nor subject area 
differences were found in the peer-assessment study. Palchikov and 
Goldfinch were surprised that senior students who are supposed to have a 
better understanding of the criteria by which they judge performance in a 
domain and also had more practice in peer assessing did not outperform their 
juniors. They suggested that the lack of differentiation between beginning 
and more advanced students is due to the public nature of peer assessment. 
Assessing one's own performance is usually done in private using one's own 
internal standards. This may be a more difficult task to do than comparing 
the public performance of one's peers and ranking their performance or skill 
acquisition process in ascending or descending order. 

We think that this finding offers two lessons to theorists on assessment. 
First, the results point to the fact that socially situated assessment (i.e., 
assessment that takes place in the context of the peer group) whether it is 
assessment of one's own performance or the assessment of a group member's 
performance, is totally different from self-assessment in relation to 
individual work. Assessment done in public implies that social expectations 
and social comparisons contribute significantly to one's judgement. For this 
reason it is highly important that assessment researchers take care not to 
lump together these four different forms of assessment (self-assessment and 
peer assessment of individual work and self-assessment and peer assessment 
of collaborative work). 

The second lesson that assessment researchers should draw from these 
findings is that motivation factors are powerfully present in any form of 
assessment and bias the students' judgement of their own or somebody else's 
performance. In a recent study on the impact of affect on self-assessment, 
Boekaerts (2002, in press) showed that students' appraisal of the demand 
capacity ratio of a mathematics task, before starting on the task, contributed 
a large proportion of the variance explained in their self-assessment at task 
completion. Interestingly, the students’ affect (experienced positive and 
negative emotions during the math task) mediated this effect. Students who 
experienced intense negative emotions during the task underrated their 
performance while students who experienced positive emotions, even in 
addition to negative emotions, overrated their performance. This finding is in 
line with much research in mainstream psychology that has demonstrated the 
effect of positive and negative mood state on performance and decision- 
making. 

On scanning the literature on assessment, we were surprised that most of 
the reported studies are largely concerned with peer assessment and self- 
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assessment of marking and grading. For example, in the literature on higher 
education, students who follow courses in widely different subject areas are 
typically asked to complete different instruments to assess diverse aspects of 
performance, including interpersonal skills and professional practice (e.g., 
poster and oral presentation skills, class participation, global peer 
performance, global traits displayed in group discussion or dyadic 
interaction, practical tests and examinations, videotaped interviews, tutorial 
problems, counselling skills, critiquing skills, internship performance, 
simulation training, ward assignments, group processes, clinical 
performance, laboratory reports). We did not come across any study that 
asked students to assess their own or their peers' interest in skill acquisition 
or professional practice. Nevertheless, we are of the opinion that students' 
interest in skill acquisition biases their self-assessment and peer assessment, 
and therefore endangers the validity and reliability of the assessment 
procedure. In order to investigate this claim, it is important that instruments 
become available that provide a window on the students' interest in the skill 
acquisition process. 

Why is it important to assess students’ interest in relation to skill 
acquisition within a domain? The results from a wide range of recent studies 
show that interest has a powerful, positive effect on performance (for a 
review, see Hoffman, Krapp, Renninger, & Baumert, 1998; Schiefele, 2001). 
This positive effect has been demonstrated across domains, individuals, and 
subject-matter areas. Moreover, interest has a profound effect on the quality 
of the learning process. Hidi (1990) documented that students, who are high 
on interest do not necessarily spend more time on tasks but the quality of 
their attentional and retrieval processes, as well as their interaction with the 
learning material is superior, compared to students low on interest. They use 
less surface level processing, such as rehearsal, and more deep level 
processing, such as elaboration and reflection. In other words, interest is a 
significant factor affecting the quality of performance and should therefore 
be considered when interpreting students’ outcomes, namely their self- 
assessment and peer assessment. In light on these results, it is indeed 
surprising that the literature on assessment and on the qualities of new 
modes of assessment mainly focuses on the assessment of performances and 
does not take account of the students’ assessment of the underlying factors 
of performance, such as personal interest in skill development and 
satisfaction of basic psychological needs. Nevertheless, it is clear that 
instruction situations that present students with learning activities that satisfy 
their basic psychological needs, create the conditions for interest to develop. 
Students can give valuable information on the factors underlying their 
interest, and as such help teachers to create more powerful learning 
environments. 
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2. ASSESSING THE STUDENTS' INTEREST IN 
SKILL ACQUISITION 



2.1 Three Basic Psychological Needs: Autonomy, 
Relatedness, and Competence 

Several researchers, amongst others Deci and Ryan (1985), Deci, 
Vallerand, Pelletier, and Ryan (1991), Ryan and Deci (2000) and Connell 
and Wellborn (1991) provided evidence that specific factors in social 
contexts also produce variability in student motivation. They theorised that 
learning in the classroom is an interpersonal event that is conducive to 
feelings of social relatedness, competence, and autonomy. On the basis of 
their extensive research, Deci and Ryan argued that students have three basic 
psychological needs: they want to feel competent during action, to have a 
sense of autonomy and to feel that one is secure and has established 
satisfying relationships. Deci and his colleagues further argued that intrinsic 
motivation, which is a necessary condition for self-regulation to develop, is 
facilitated when teachers support rather than thwart their students’ 
psychological needs. More concretely, they predicted that intrinsic 
motivation develops by providing optimal challenges for one’s students, 
providing effectance, promoting feedback, encouraging positive interactions 
between peers, and keeping the classroom free from demeaning evaluations. 

Deci and Ryan’s influential work provided insights into the reasons 
behind students’ task engagement in the classroom. They linked students’ 
satisfaction of basic psychological needs to their engagement patterns, 
locating regulatory styles along a continuum ranging from amotivation, or 
students’ unwillingness to cooperate, to external regulation, introjection, 
identification, integration, and ultimately active personal commitment or 
intrinsic motivation. Evidence to date suggests that students do not progress 
through each stage of internalisation to become self-regulated learners within 
a particular domain. Prior experiences and situational factors influence this 
process. As far as the situational factors are concerned, Ryan and his 
colleagues (e.g., Ryan, Stiller, and Lynch, 1994) showed that students need 
to be respected, valued, and cared for in the classroom in order to be willing 
to accept school-related behavioural regulation (see also Battistich, 
Solomon, Watson, & Schaps, 1997). They also want to satisfy their need for 
social relatedness, and their need to feel self-determined. Williams and 
Deci’s (1996) longitudinal study showed that in order to become self- 
regulated learners, in the true sense of the word, students need teachers who 
are supportive of their competency development and their sense of 
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autonomy. Deci, Egharari, Patrick, and Leone (1994) clearly showed that 
autonomy support is cmcial for self-regulation to develop (i.e., students must 
grasp the meaning and worth of the learning goal, endorsing it fully and 
using it to steer and direct their behaviour). It is important to note in this 
respect that students who work in a controlling context may also show 
internalisation, provided the social context supports their competency 
development and social relatedness. However, under these conditions, 
internalisation is characterised by a focus on approval from self or others 
(introjection). 

2.2 Assessing Feelings of Autonomy, Competence, and 
Relatedness On-line 

Despite this interesting research, instruments that help teachers to gain 
insights into the interplay between students’ developing self-regulation, on 
the one hand, and their need for competence, autonomy, and social 
relatedness, on the other, are still rare. Yet, such instruments are essential to 
help teachers create a learning environment that is conducive to deep 
learning in successive stages of a course and to the development of self- 
regulation. We reasoned that it is crucial that students are invited to set their 
own goals and to direct their learning towards the realisation of these goals, 
yet perceive that the teacher is supportive of their autonomy, competence 
development, and social relatedness. This is particularly true for students 
who are working in cooperative learning environments with the teacher as a 
coach. By implication, it is important that teachers gain insight into how 
their students interpret the learning environment. This information will help 
them to increase or decrease task demands, external regulation (or 
scaffolding), and social (in)dependence in a flexible way. Ideally, teachers 
should develop antennae to pick up such signals. In order to help teachers to 
grow these antennae, we constmcted an instrument that registers how 
individual group members value the learning environment in terms of the 
autonomy it grants, in terms of the perceived feeling of belonging, and in 
terms of their competency development. We reasoned that, students who are 
working on self-chosen group projects for several weeks: 

1. are aware of their feelings of autonomy, competence and social 

relatedness, 

2. can report on these feelings, and 

3. can use these feelings as a source of information for determining how 

interested they are in the group project. 

We predicted that feelings of competence, autonomy, and relatedness 
fluctuate during the course of a group project and have a strong impact on 
reported personal interest during successive stages of the project. We also 
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reasoned that positive and negative perceptions of constraints and 
affordances at any point in time interact and jointly determine whether 
students appraise the current learning opportunity as optimal or sub-optimal 
for group learning. During the course of our work with students in higher 
education, we had observed many times that undergraduates react differently 
to external regulation, social support, and scaffolding in the various stages of 
a learning trajectory. It is easy to imagine that the extent to which students 
perceive that their fluctuating psychological needs are fulfilled has an impact 
on their interest in the project, implying that interest also fluctuates over 
time. Our position is that students who are working in a learning 
environment that they perceive as "optimal" are willing to invest resources to 
self-regulate their learning. By contrast, students who perceive the learning 
context as "sub-optimal" (e.g., not enough structure, no autonomy support) 
decline the teacher’s offer to coach their self -regulation process, mainly 
because they feel a lack of purpose (no goal-oriented behaviour), low 
relatedness, and no inclination to engage in learning tasks set by the teacher 
or by the group. Most teachers refer to this feeling state as: low personal 
interest in the task or project. 

2.3 Constructing the First Version of the Quality of 
Working in Group Instrument 

The focus in this paper is on the construction of the paper-and-pencil 
version of an instrument that assesses students’ feelings of autonomy, 
competence, and relatedness on-line during successive sessions of working 
on a group project. In order to test the hypothesis that feelings of autonomy, 
competence and social relatedness fluciuate during the course of a group 
project and have a strong impact on personal interest, we needed an 
instrument that captures the fulfilment of these basic needs on-line. 
Basically, there are three choices one can make: signal-contingent methods, 
event-contingent methods, and interval-contingent methods. After careful 
considerations of the alternatives, it was decided to opt for event-contingent 
sampling. 

The paper and pencil version of the Quality of Working in Group 
Instrument (QWIGI) was constructed after examining relevant instruments 
and several try-outs in secondary vocational education and in higher 
education. QWIGI is a simple instrument that consists of a number of self- 
report items that can be answered on Likert scales. Completing the 
questionnaire requires that students stop and think about the quality of the 
group learning process as they currently perceive it, starting with the 
particular feature that is highlighted in the item. Based on observations in the 
college classroom, we predicted that students’ sense of autonomy (feeling 
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free to initiate and regulate their own actions) during group work is 
intricately linked to their understanding of how to solve a problem or 
complete an assignment and being self-efficacious in performing the 
necessary and sufficient actions (competence) as well as to their ability to 
establish satisfying connections with members of the group (relatedness). 
The relation between student’ perception of competence and social 
relatedness is less clear. A second prediction pertains to personal interest. It 
was hypothesised that perceived autonomy, competence, and social 
relatedness jointly influence students’ assessment of personal interest. A 
third prediction concerns the impact that these three predictors have, over 
time on the assessment of personal interest. In line with our observations in 
the college classroom, it was predicted that the degree of personal interest 
that students express in a group assignment could best be explained at the 
start of the project and just before finishing the project. 



3. RESEARCH METHOD 



3.1 Subjects 

Participants were 54 undergraduate students who participated in a course 
in Educational Psychology that lasted several weeks and was taught 
according to the principles of social constructivism. The vast majority of the 
students were females. Students worked in nine self-selected groups of 5 or 6 
students. Data of 4 students with incomplete responses were excluded from 
all analyses. 

3.2 Instrument 

The Quality of Working in Groups Instrument consists of a single printed 
sheet on which 10 bipolar items are presented. Together, these items assess 
the students’ psychological needs: feelings of autonomy (2), social 
relatedness (2), and competence (2). In addition, their interest in the group 
project is assessed (2) as well as the degree of responsibility for learning (2). 
It was decided to include the latter two items because we thought that 
students’ need to develop secure and satisfying relationships with peers is 
empirically distinct from the degree to which they feel personally 
responsible for promoting group learning. Each item consists of a five-point- 
bipolar-Likert-scale with two opposing statements located at either end of 
the scale. The statements were constructed on the basis of discussions and 
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interviews with similar groups of students in previous years. An example of 
an item intended to measures students’ feelings of autonomy is: 

There is plenty of room for ooooo There is no room for making 

making our own decisions ooooo our own decisions 

The exact wording of the items can be found in the appendix. Students 
who indicated high agreement with a positive statement (i.e. high feelings of 
autonomy, competence, social relatedness, and interest, respectively) 
received a score of 5 whereas high agreement with the negative statements 
received a score of 1. 

3.3 Procedure 

The course lasted 12 weeks and was split up into two consecutive units. 
The first unit involved five three-hour sessions that each started with direct 
teaching followed by group work. Students had to prepare for class and used 
this material when working in self-chosen groups of five or six students. 
They worked on parallel or complementary assignments and performed one 
of the rotating roles (chairperson, written report secretary, verbal report 
secretary, resource manager, ordinary member). At the end of each session, 
the verbal report secretary presented the group solution in public and the 
teacher invited all students to decontextualise the presented solutions. The 
first unit was completed with a written exam. The second unit of the course 
also consisted of five three -hour sessions. Unlike in the first unit, students 
did not have to take an exam but had to write a group paper that would result 
in a group mark. They were told that the group paper should focus on a 
specific self-regulatory skill that they wanted to improve in primary school 
students. In order to build up the competence of all the group members, 
students had to read specific articles for each session, visit primary schools 
and observe relevant classroom sessions, and enter into dialogue with group 
members in order to construct their own opinion of the merit of the 
intervention modules they had read about. All groups were free to select the 
domain (math, reading, writing, etc.) as well as the types of metacognitive or 
motivation skill(s) they wanted to improve. They had to set their own goals 
and monitor their progress to the goal. In order to help them structure and 
organise the group activities and the preparation for the paper, teacher- 
guided discussions were organised at the beginning of each three-hour 
session. The literature covered for that session was discussed and the teacher 
also provided a framework for interpreting this information and for linking it 
to the topics discussed in previous sessions. For the remaining time of the 
three -hour sessions, students worked on their own project and the teacher 
provided feedback on the group activities and on the preliminary table of 
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contents for the paper. The QWIGI was handed to the resource managers 
before students started on their group project and they were asked to hand it 
to the group members after about one hour into the group work. The 
resource manager collected the completed questionnaires and dropped them 
in a box that was located at the exit of the room. Students were highly 
compliant with the request to complete the questionnaires. 

3.4 Assessing the Reliability and Validity of QWIGI 

Palchikov and Goldfinch (2000) were concerned with the reliability and 
validity of the assessment instruments used in the studies that were included 
in their review. They explained that high peer-faculty agreement is the best 
indicator of "validity" of an instrument, whereas high agreement between 
peer ratings is the best indicator of the "reliability" of an assessment 
instrument. As previously mentioned, the assessment instruments reviewed 
in Palchikov and Goldfinch’s meta-analysis mainly concern marking and 
grading. Clearly, marks and grades used in educational settings are neither 
very reliable nor very valid indicators of achievement, even when there is 
reasonable agreement between various raters. It is generally accepted that 
multiple ratings are superior to single ones (Pagot, 1991) because the ratio of 
true score variance to error variance is increased. Likewise, when group 
members are asked to judge the performance of each participating student, 
the reliability of the average scores is increased with the number of raters, 
but group size should be kept small to avoid the social-loafing effect 
(Latane, Williams & Hawkins, 1979). Contrary to the assessment 
instruments reviewed in Palchikov and Goldfinch’s study, our instmment 
does not deal with peer assessment or self-assessment of performance. 
Rather, it concerns peer assessment and self-assessment of the quality of the 
learning process within a collaborative learning context and the impact of 
these perceptions have on personal interest in the project. It is our basic tenet 
that students’ profiles (i.e. their score on perceived autonomy, competence, 
and relatedness) are of crucial importance with respect to their assessment of 
personal interest in skill acquisition. 

Contrary to what is acceptable practice concerning the validity of 
performance assessment, faculty ratings are not more accurate than student 
ratings in relation to the assessment of personal interest. On the contrary, 
students are better judges of their personal interest than faculty ratings. To 
evaluate the construct validity of the items at the different measurement 
points, confirmatory factor analyses with LISREL (Joreskog & Sorbom, 
1993) were performed. The hypothesis tested was that perceived autonomy, 
competence, and social relatedness jointly impact on students’ personal 
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interest and the impact of these predictors over time was estimated with 
multiple regression analyses. 

To evaluate the reliability of the QWIGI scores, neither classical item 
analysis nor high agreement between peer ratings are considered appropriate 
estimations of the reliability of the scale. Classical item analysis is 
considered inappropriate due to the restricted number of items per scale 
(namely 2). High agreement between peer ratings is regarded as an 
inappropriate indicator of reliability due to the presupposed process-related 
fluctuations in students’ psychological needs during the course of a group 
project. As mentioned previously, we wanted to link students’ profiles on 
perceived autonomy, competence, and relatedness to their developing 
interest in the group project. We therefore decided to calculate profile 
reliability for all the scales, using Lienert and Raatz’s (1994, p. 324) 
formula. These researchers established that profile reliability is stronger 
when the reliability of the separate scales is high and the inter-correlations 
between the scales are low. Lienert and Raatz mentioned a correlation 
coefficient of .50 as the lower limit of sufficiency. 



4. RESULTS 

In Table 1 the mean scores and standard deviations for autonomy, 
competence, social relatedness, and interest are given for the total group and 
for the nine groups, separately for the five measurement points (one each 
session). Using the formula provided by Lienert and Raatz, profile reliability 
was calculated. It was more than sufficient for further use in the context of a 
course that was taught according to the principles of social constructivism: 
profrtt=.71. 
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Table /. Means and standard deviations for autonomy, competence, social relatedness, and 
interest by group and by measurement moment 





Autonomy 


Competence 


Social relatedness 


Interest 


Group 


M 


50 


M 


SD 


M 


SD 


M 


SD 








Moment 1 (orientation stage) 








Bookies 


8.00 


1.41 


6.60 


2.30 


8.60 


0.96 


8.80 


1.64 


Celsius 


7.00 


2.45 


6.40 


1.67 


8.20 


0.57 


7.20 


1.48 


Espril 


7.00 


1.15 


6.25 


0.50 


7.62 


1.43 


7.75 


2.63 


Groups 


6.50 


0.84 


7.17 


0.98 


8.08 


0.66 


4.67 


1.63 


Lot 


8.33 


2.08 


6.00 


1.00 


8.16 


0.57 


8.67 


1.53 


The crazy chicks 


8.00 


1.00 


6.67 


I.I5 


8.16 


1.15 


10.00 


0.00 


Manpower 


7.33 


2.42 


7.00 


1.63 


8.07 


1.90 


8.57 


1.40 


The knife 
fighters 


7.00 


2.10 


6.67 


1.21 


8.00 


1.09 


8.83 


0.75 


Skillis 


8.20 


1.30 


4.60 


1.95 


9.40 


1.08 


10.00 


0.00 


Total 


7.38 


1.72 


6.43 


1.56 


8.26 


1.16 


8.14 


2.08 










Moment 2 (execution stage) 








Bookies 


8.40 


2.19 


5.60 


1.67 


8.70 


0.75 


8.80 


0.84 


Celsius 


7.20 


2.68 


7.20 


2.28 


7.20 


0.44 


8.00 


1.41 


Esprit 


8.33 


0.82 


6.67 


1.63 


7.91 


1.46 


8.67 


1.51 


Groups 


8.17 


1.17 


7.33 


1.03 


8.50 


0.63 


7.83 


1.47 


Lot 


9.75 


0.50 


5.50 


1.91 


8.12 


1.10 


6.25 


2.63 


The crazy chicks 


6.33 


1.53 


6.33 


1.53 


7.50 


1.32 


7.00 


2.65 


Manpower 


6.00 


2.16 


7.86 


1.46 


7.64 


1.90 


8.14 


1.95 


The knife 
fighters 


6.20 


1.30 


6.80 


2.17 


8.30 


1.25 


8.80 


0.45 


Skillis 


7.80 


1.30 


6.40 


1.14 


9.30 


0.75 


10.00 


0.00 


Total 


7.54 


1.92 


6.74 


1.69 


8.14 


1.25 


8.26 


1.72 










Moment 3 (execution stage) 








Bookies 


7.67 


2.07 


7.50 


1.52 


8.83 


1.21 


8.83 


1.60 


Celsius 


7.00 


2.28 


7.17 


2.23 


9.16 


0.81 


8.67 


1.21 


Esprit 


8.67 


1.51 


7.83 


1.47 


8.75 


1.03 


9.67 


0.52 


Groups 


7.50 


1.76 


5.33 


2.73 


7.60 


1.19 


4.00 


2.10 


Lot 


7.75 


0.50 


6.75 


2.06 


8.25 


0.86 


8.50 


1.29 


The crazy chicks 


7.40 


0.89 


7.00 


1.83 


8.40 


0.74 


8.60 


1.14 


Manpower 


8.29 


1.98 


7.86 


1.07 


7.85 


1.43 


8.29 


1.11 


The knife 
fighters 


6.75 


2.36 


6.50 


0.58 


7.50 


1.68 


9.25 


0.96 


Skillis 


8.60 


1.95 


7.00 


2.24 


8.70 


1.56 


9.40 


0.89 


Total 


7.78 


1.78 


7.04 


1.87 


8.37 


1.13 


8.29 


2.05 
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Moment 4 (execution stage) 



Bookies 


7.67 


2.07 


7.50 


1.52 


8.83 


1.21 


8.83 


1.60 


Celsius 


6.50 


2.35 


6.60 


1.52 


7.66 


1.12 


7.67 


1.86 


Esprit 


8.67 


2.16 


6.67 


1.63 


8.60 


1.38 


8.83 


1.60 


Groups 


7.17 


2.14 


8.67 


1.97 


7.75 


0.98 


7.00 


1.90 


Lot 


8.SS 


2.07 


7.00 


1.79 


8.08 


1.62 


9.67 


0.52 


The crazy chicks 


8.40 


1.52 


7.60 


1.14 


8.30 


0.44 


8.40 


1.34 


Manpower 


6.17 


1.60 


7.67 


1.03 


6.25 


1.57 


7.33 


1.97 


The knife 
fighters 


6.20 


1.48 


5.80 


1.48 


8.40 


0.74 


7.60 


1.82 


Skillis 


9.25 


1.50 


6.25 


1.71 


8.16 


0.28 


8.25 


2.87 


Total 


7.54 


2.06 


7.14 


1.65 


7.96 


1.31 


8.18 


1.81 








Moment 5 (wrapping up stage) 






Bookies 


8.00 


1.58 


8.20 


1.30 


9.40 


0.65 


9.40 


0.89 


Celsius 


7.00 


2.00 


6.67 


1.97 


6.08 


2.81 


7.17 


2.79 


Esprit 


9.00 


1.00 


7.60 


0.55 


8.60 


1.38 


10.00 


0.00 


Groups 


- 


- 


- 


- 


- 


- 


- 


- 


Lot 


- 


- 


- 


- 


- 


- 


- 


- 


The crazy chicks 


8.33 


1.21 


8.00 


1.10 


8.41 


0.73 


8.17 


2.14 


Manpower 


7.50 


2.07 


8.50 


1.38 


5.75 


1.94 


8.50 


1.22 


The knife 
flghlers 


- 


- 


- 


- 


- 


- 


- 


- 


SkiUis 


9.25 


1.50 


7.00 


0.82 


9.37 


0.47 


10.00 


0.00 


Total 


8.09 


1.69 


7.69 


1.38 


7.78 


2.14 


8.75 


1.85 



The internal structure of the questionnaire was examined by confirmatory 
factor analysis on the competence, autonomy and relatedness items, 
separately for the five measurement points. We presumed that the items had 
only substantial loadings on the intended factors. The matrix with 
intercorrelations between the latent factors was set free because the 
constructs are not presumed to act independently of each other. The 
confirmatory factor analyses with Maximum Likelihood estimations yielded 
good indices of fit, with overall (GFl), incremental (IFl), and comparative 
(CFl) goodness-of-fit measures ranging between .91 and .99. The c2/df 
ratios varied between 0.74 and 2.05, which according to Byrne (1989) is 
evidence of a good fit between the observed data and the model. All items 
had significant, high loadings on the intended factors, stressing the internal 
validity of the items involved. With respect to the construct validity of 
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QWIGI, it is concluded that although a very restricted number of items was 
used to measure competence, autonomy, social relatedness (including the 
degree of responsibility for learning), and interest, the intended factors were 
unequivocally retrieved. 

The correlations between the three basic psychological needs (i.e., 
competence, autonomy, and social relatedness), and between these needs and 
personal interest are printed in Table 2, separately for each measurement 
point. 



Table 2. Pearson correlations between autonomy, competence, social relatedness, and 
interest by measurement moment (N =32 d 50) 







Autonomy 


Competence 


Social 

relatedness 


Moment 1 
(orientation 


Competence 

Social 


.39** 

.46** 


.15 




stage) 


relatedness 

Interest 


.57** 


-.03 


.33** 


Moment 2 
(execution stage) 


Competence 

Social 


.21 

.38** 


.12 






rclatcdness 

Interest 


.16 


.39** 


.41** 


Moment 3 
(execution stage) 


Competence 

Social 


.25* 

.13 


.32* 






relatedness 

Interest 


.27* 


.38** 


.39** 


Moment 4 
(execution stage) 


Competence 

Social 


.06 

.41** 


.18 






relatedness 

Interest 


.41** 


.10 


.41** 


Moment 5 
(wrapping up 


Competence 

Social 


.41** 

.57** 


.22 




stage) 


relatedness 

Interest 


.61** 


.37** 


.56** 



*p<.05;**p<.01 



Close inspection of the patterns of correlations revealed that there are 
basically three stages in the project, namely an orientation stage that is quite 
short (1 session), a wrapping-up stage that involves the last session and 
probably, for some groups, the penultimate session. The intermediate stage 
spans two to three weeks. Examination of these correlational data reveals 
that, as predicted, autonomy is associated with both competence and social 
relatedness, as well as with interest in the orientation and wrapping up stage. 
The correlations between competence and social relatedness are low to 
modest in all stages of the project, except on measurement point 3 (.32). It is 
noteworthy that in the last session of the execution stage, autonomy and 
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competence are not associated. This implies that students, who scored high 
or low on perceived competence, expressed a sense of low autonomy, and 
vice versa. At the same measurement point, we note that autonomy has a 
moderate association (.41) with social relatedness, meaning that students 
high or low on social relatedness express a sense of low autonomy. 

A series of multiple regression analyses were conducted to examine how 
much variance could be explained in personal interest by the three predictors 
set at the various measurement points. 



Table 3. Common and unique effects of autonomy (A), competence (C), and social 
relatedness (SR) on interest per measurement moment 





Predictor(s) 


Zero-order 

correlation 


Semi-partial 

correlation 


R 


Moment 1 


A 


.57** 


.54** 




(orientation stage) 


C 


-.026 


-.27* 






SR 


.33** 


.06 






A+C+SR 






.64** 


Moment 2 


A 


.16 


-.06 




(executive stage) 


C 


.39** 


.35** 






SR 


.41** 


.37** 






A+C+SR 






.54** 


Moment 3 


A 


.27* 


.17 




(executive stage) 


C 


00 


.22 






SR 


.39** 


.27* 






A+C+SR 






.50** 


Moment 4 


A 


.41** 


.27* 




(executive stage) 


C 


.10 


.03 






SR 


.41** 


.26* 






A+C+SR 






49 ** 


Moment 5 


A 


.61** 


.27* 




(wrapping up stage) 


C 


.37** 


.14 






SR 


.56** 


.26* 






A+C+SR 






.67** 



*p<.05;**p<.01 



Table 3 shows the amount of variance explained in interest, the multiple 
correlations between the joint psychological needs and interest, and the 
unique effects of the psychological needs on interest. Most variance was 
explained in the wrapping-up stage (45%), followed by the orientation stage 
(40%). The amount of variance explained in the execution stage decreased 
over time. Interestingly, students’ feeling of competence did not contribute 
unique variance to personal interest, except in the orientation stage. 
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Remarkably, being self-efficacious when starting on the group project 
affected personal interest negatively, but having a sense of autonomy 
contributed a large portion of unique variance to personal interest. Please 
note that social relatedness did not contribute unique variance to interest in 
the project in the orientation stage. This is easy to understand because the 
students did not know yet whether the other group members would feel 
committed to the project, resulting thus in satisfying relationships. In all 
successive stages, the semi-partial correlations for social relatedness reached 
significance. Autonomy is predictive of interest on the last three 
measurement points of working on the project, but not on measurement point 
2 . 

To examine whether the self-selected groups differed significantly on 
autonomy, competence, social relatedness, and interest, a MANOVA was 
run. This analysis was followed by a Games-Howell post hoc test to examine 
which groups differed significantly on the four dependent measures (see 
Table 4). This type of post hoc test was preferred because homogeneity of 
variances on the three psychological needs and on interest were not assumed 
in the different groups. Based on the multivariate tests, it is noteworthy that 
the groups differed most in the orientation stage. In this stage the groups 
differ mainly in interest: the post hoc tests indicate that the group scores on 
interest of five groups (i.e., “bookies”, “the crazy chicks”, “manpower”, “the 
knife fighters”, and “skillis”) exceed significantly (“>“) one specific group 
(i.e., “group3”). During the execution stage, the multivariate significance 
drops from moment 2 to moment 4. In this stage, autonomy, social 
relatedness, and interest play a role at distinct moments during project work. 
In the wrapping-up stage, multivariate significance increased, indicating that 
group differences increased. The group effect on the psychological need, 
social relatedness, at the last two measurement points is remarkable. 
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Table 4. Manova of group effect on autonomy (A), competence (C), social relatcdness (SR), 
and interest (I) per measurement moment 




10 >4*.;so >4 
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5. DISCUSSION 

In the introduction, we remarked that motivation factors, including affect 
and interest, are powerfully present in any learning situation. Three points 
were made. First, interest is a significant factor affecting the quality of 
performance and should therefore he considered when interpreting students’ 
outcomes and their self-assessment. Second, affect experienced during the 
learning situation impacts on self-assessment. Third, new forms of 
instruction may or may not provide the conditions that satisfy students’ basic 
psychological needs. Students can give valuable information on the factors 
underlying their interest in a domain or activity (satisfaction of their 
psychological needs) and this information can help the teacher coach the 
learning process. 

As argued previously, attempts to study students’ feelings of autonomy, 
competence, and social relatedness in close connection with their developing 
interest are rare. We reasoned that university students who are working on 
self-chosen projects for several weeks are aware of their feelings of 
autonomy, competence and social relatedness and can report this 
information. We also hypothesised that college students use this information 
when assessing their personal interest in a group project. The focus in this 
paper was on the construction of an instrument that assesses university 
students’ interest on-line during successive sessions of working on a group 
project. We predicted and found that feelings of competence, autonomy, and 
social relatedness fluctuate during the course of the group project and that 
satisfaction of these basic psychological needs has a strong impact on the 
personal interest students express in the project during the successive stages. 
Furthermore, our results suggest that expressed personal interest in a group 
project is, to a large extent, determined by the student’s need satisfaction. In 
other words, when students express low interest in a group learning project it 
is advisable that teachers take a closer look at the reasons why their 
psychological needs are not satisfied because their needs act as signposts on 
the way to the students’ developing personal interest. In Figures la and lb 
we have visualised the relation between personal interest and the three 
underlying psychological need states for two groups, namely Celsius and 
Skillis. As can be seen in these figures, the curve depicting social relatedness 
is closely linked to expressed personal interest in both groups. In the Celsius 
group, the curves for need of autonomy and competence are intertwined and 
influence interest jointly. In the Skillis group, autonomy seems to affect 
interest separately from students’ need of competence. 
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Moment 



(b) Group Skillis 

Figure I. Relation of interest scores (I) to the three psychological need states (Autonomy, 
Competence, and Social relatedness) over time for group Celsius (a) and group Skillis (b) 

In line with Krapp’s (2002) reasoning, we define personal interest in a 
group project as a relational construct that describes the person-object 
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relation as it is represented in the students’ mind. We suggest that students’ 
interest in a group project is a non-linear process that takes place in multiple 
overlapping contexts. At least two contexts can be discerned. The first 
context is idiosyncratic in the sense that it is based on the student’s 
motivational beliefs about the domain. These beliefs, which are the result of 
previous encounters with similar content, are fed into the current situation by 
activating long-term memory schemata. This idiosyncratic context should be 
differentiated from a more socially determined context that represents 
students’ current perceptions of group processes and relations. 

Which lessons can assessment researchers draw from this study? What 
does the study mean for ongoing research into new modes of assessment? 
We found that the students’ satisfaction of their psychological needs 
explained most variance in reported interest at the beginning and end of the 
project. It seems that students’ assessment of the conditions for learning in 
terms of their perception of autonomy, need for competence, and social 
relatedness is a good predictor of their interest. We do not want to run the 
risk of over-interpreting the data from a single study with a limited sample 
and many potential biases. Hence, we will not compare in detail the pattern 
of the psychological needs across the various data collection points. Suffice 
it to draw the reader’s attention to the semi-partial correlations recorded in 
the orientation and the wrapping up stage that are reported in Table 3. As can 
be seen from this table, satisfaction of the need of autonomy is very 
important for developing personal interest in the initial stages of the project. 
The need to satisfy social relatedness does not seem to contribute (in terms 
of unique variance) much to personal interest expressed in the project and 
the need to satisfy competence has an inverse relationship with interest at 
this stage. At the end of the project, the need to satisfy autonomy is less 
important than in the beginning. Social relatedness now contributes much 
more to personal interest expressed in the project and the need to satisfy 
competence has now a modest, unique contribution to interest. This finding 
suggests that the pattern of the factors that underlie student interest in the 
project fluctuate during the project and influence each other. It is evident that 
beginners in a specific domain differ from experts in the psychological needs 
that they want to satisfy most urgently in order to express interest in an 
activity. It is highly likely that the pattern between the three basic 
psychological needs changes when students have discovered for themselves 
that learning with and from each other does increase their competence. 

Our position is that students assess the learning conditions when they are 
confronted with a new assignment and continue to assess these conditions 
during their actual performance on the assignment in terms of the 
satisfaction of their basic psychological needs. This assessment affects the 
way they reflect on their performance and their judgement of progress (self- 
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assessment). In other words, the students’ perception of the learning 
conditions in terms of the satisfaction of their psychological needs is 
important for interest assessment (and interest development) as well as for 
skill assessment (and skill development). 

Our view is in accordance with Kulieke et al.'s (1990) proposal to 
redefine the assessment construct. These researchers suggested that several 
assessment dimensions should be considered, amongst others, the 
registration of the extent to which the dynamic learning process was assessed 
on-line. Researchers and teachers who are involved in collaborative research 
applauded this new assessment culture, for they are in need of tools that 
inform the students whether their investment in a course or assignment 
results in deeper understanding of the content. Our argument is that it is not 
only skill development that should be assessed (self-assessed, peer-assessed, 
and teacher assessed) but also the students' developing interest in skill 
development. Indeed, we believe that students will (continue to) invest 
resources in skill development, provided they realise that their personal 
investment leads to valued benefits, such as intrinsic motivation, career 
perspectives, and personal ownership of work. Ultimately, what gets 
measured gets managed. 

In the study reported here, the information that became available through 
the self-report data was not fed back to the students during the project. On 
the basis of the satisfactory results reported here we decided to transform the 
paper and pencil version of the QWIGI into a computer-based instrument. A 
digital version of the questionnaire allows us to visualise the waxing and 
waning of a student’s basic psychological needs as well as his or her 
assessment of developing interest in the group project. Allowing students to 
inspect the respective curves that depict various aspects of their self- 
assessment and inviting them to reflect on the reasons behind their self- 
assessment is a powerful way to confront them with their perception of the 
constraints and affordances of the learning environment. The computerised 
version of the questionnaire also allows students to inspect each group 
member’s curves and gain information on how their peers perceive the 
quality of the learning environment and how interested they are in the group 
project. We think that the QWIGI is particularly suited for students who are 
not yet familiar with group projects and for students who express low 
personal interest in learning from and with each other. The digital version of 
the instrument is currently used in vocational schools. Students enjoy having 
the opportunity to assess their personal interest and aspects of the learning 
environment, especially when these assessments are visualised on the screen 
in bright colours. An additional benefit of the digital version of the 
instrument is that the detailed information is also available to the students’ 
teachers. Information about students’ developing interest in a group project 
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allows teachers to encourage groups of students to focus on those aspects of 
the learning episodes that are still problematic for them at that point in time. 
It also allows teachers to change the task demands, provide appropriate 
scaffolding, or change the group composition when appropriate. 
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1. INTRODUCTION 

The historical evidence availahle on the long tradition of assessment 
programs and methods, takes us hack to Biblical accounts and to the early 
Chinese civil service exams of approximately 200 B.C. (Cizek, 2001). Since 
then, socio-political and educational theories and beliefs have had 
determining impact on the definitions used to implement assessment 
programs and the use of the information derived from them. The 
determination of standards is a central aspect of this process. Over time, 
these standards have been defined in numerous ways, including the setting of 
arbitrary numbers for passing, the unquestioned establishment of criteria by 
a ruling board, the performance of individuals in relation to a reference 
group (not always the “same” group or a “fair” group to compare with), and 
many other criteria. More recently, changes in the understanding of social 
and educational phenomena have inspired a movement to make assessments 
more relevant and better adjusted to the educational goals and the personal 
advancement of those being tested. Simultaneously, the exponential increase 
of information and the demanding higher levels of complexity involved in 
contemporary life, require the determination of complex levels of 
performance for many current assessment needs. The emergence of new 
techniques and modalities of assessment has made it possible to address 
some of these issues. These new methods have also introduced new 
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challenges to maintain the necessary rigor in the assessment process. 
Standard setting methodologies have provided a means to improve the 
implicit and explicit categorical decisions in testing methods, making these 
decisions more open, fair, informed, valid and defensible (Mehrens & Cizek, 
2001). There is also now the realisation that standards and the cut scores 
derived from them are not “found”, they are “constructed” (Jaeger, 1989). 
Standard setting methods are used to construct defensible cut scores. 



2. STANDARD SETTING METHODS: HISTORICAL 
OVERVIEW 

Standard setting methods have been used extensively since the early 
1970’s as a response to the increased use of criterion-referenced and basic 
skills testing to establish desirable levels of proficiency. The Standards for 
Educational and Psychological Testing (AERA, APA, NCME, 1999) 
establish that cut scores based on direct judgement should be designed so 
that these experts bring their knowledge and experience in determining such 
cut scores (Standard 4.21). During the standard setting process the 
judgements of these experts is carefully determined so that the consistency 
of the process and subsequent establishment of standards is appropriate. 
Shepard (1980) admonishes that standard-setting procedures, particularly for 
certification purposes, should balance judgement and passing rates: 

At a minimum, standard-setting procedures should include a balancing of 
absolute judgements and direct attention to passing rates. All of the 
embarrassments of faulty standards that have ever been cited are 
attributable to ignoring one or the other of these two sources of 
information, (p. 463) 

Early references to what came to be known as criterion-referenced 
measurement can be found in John Elanagan’s chapter “Units, Scores, and 
Norms” (Educational Measurement, 1951). He distinguishes between 
information regarding test content and information regarding ranks in a 
specific group, both derived from test score information. 

Thus he clearly associated content-based score interpretations with the 
setting of achievement standards. There were no suggestions on how to set 
these standards, and even Ebel in 1965 (Ebel, 1965; 1972) gives no concrete 
advice on the setting of passing scores, and discourages doing so. 

By the time the second edition of Educational Measurement was 
published in 1971, standard-setting methodologies were being proposed for 
the wave of criterion-referenced measures of the time. The term “criterion- 
referenced” can be traced to Glaser & Klaus (1962) and Glaser (1963), 
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although the underlying concepts (i.e., standards vs. norms, focus on text 
content) had already been articulated in the literature (Flanagan, 1951). The 
criterion-referenced testing practice that ensued proved to be a strong push in 
the growth of standard-setting methodologies. 

The most widely known and used multiple-choice standard setting 
method: the "Angoff method," was initially described in a mere footnote to 
Angoffs chapter "Scales, Norms and Equivalent Scores" in the second 
edition of Educational Measurement (Angoff, 1971). The footnote explained 
the “Angoff Method” as a "systematic procedure for deciding on the 
minimum raw scores for passing and honours." 

Angoff (1971) very concisely described a method for setting standards. 

... keeping the hypothetical "minimally acceptable person" in mind, one 
could go through the test item by item and decide whether such a person 
could answer correctly each item under consideration. If a score of one is 
given for each item answered correctly by the hypothetical person and a 
score of zero is given for each item answered incorrectly by that person, 
the sum of the item scores will equal the raw score earned by the 
"minimally acceptable person." (p. 514) 

To allow probabilities rather than only binary estimates of success or 
failure on each item, Angoff (1971) explained: 

A slight variation of this procedure is to ask each judge to state the 
probability that the "minimally acceptable person" would answer each 
item correctly. In effect, the judges would think of a number of 
minimally acceptable persons, instead of only one such person, and 
would estimate the proportion of minimally acceptable persons who 
would answer each item correctly. The sum of these probabilities, or 
proportions, would then represent the minimally acceptable score, (p. 
515) 

Since then, variations on the Angoff Method have been used widely, but 
most of the current standard setting methods establishing these procedures 
have dealt mainly with multiple-choice tests (Angoff, 1971; Ebel, 1972; 
Hambleton & Novick, 1972; Millman, 1973; Flake, Melican & Mills (1991); 
Flake & Impara, 1996; Flake, 1998; Flake, Impara & Irwin, 2000; Sireci & 
Biskin, 1992; Zieky, 2001). 

In Zieky’ s Historical Ferspective on Standard Setting (Zieky, 2001), he 
identifies the recent challenges of standard setting as an attempt to address 
the additional complications of applying standards to constructed-response 
tests, performance tests and computerised adaptive tests. 
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3. EXTENDED-RESPONSE STANDARD SETTING 
METHODS 

Many of the new modes of assessment, including so-caUed “authentic 
assessments”, address complex behaviours and performances that go heyond 
the usual multiple-choice tests. This is not to say that ohjective testing 
methods cannot he used for the assessment of these complex abilities and 
skills, but constructed response methods many times present a practical 
alternative. Setting of defensible, valid standards becomes even more 
relevant for the family of constructed response assessments, which include 
extended-response instruments. 

Several methods to carry out standard settings on extended-response 
examinations have been used. Faggen (1994) and Zieky (2001) describe the 
following methods for constructed-response tests: 

1 . the B enchmark Method, 

2. the Item-Level Pass/Fail Method, 

3. the Item-Level Passing Score Method, 

4. the Test-Level Pass/Fail Method, 

5. the Cluster Analysis Method, and 

6. the Generalised Examinee-Centred Method. 

3.1 Benchmark Method 

In this method judges study "benchmark papers" and scoring guides that 
serve to illustrate the performance expected at relevant levels of the score 
scale. Once this has been done, judges select papers at the lowest level that 
they consider acceptable. The judgements are shared and discussed among 
judges, and they repeat the process until relative convergence is met. 
Obtained scores are averaged, and the score for the minimum acceptable 
paper is determined as the recommended passing score. 

3.2 Item-Level Pass/Fail Method 

In this method judges read each paper and classify them as “passing” or 
“failing” without having been exposed to the original grades. Then, they 
discuss the results obtained and collate the papers. Again, this is an iterative 
process in which judges can revise their ratings. The process yields estimates 
of the probability that papers at the various score levels are considered 
“passing” or “failing”. The recommended standard is the point at which the 
probability of classification to each group is .5 (assuming consequences of 
either misclassification are equal). 
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3.3 Item-Level Passing Score Method 

In this method judges estimate the average score that would be obtained 
by a group of minimally competent examinees. They accomplish this after 
considering the scoring rules and scheme, and the descriptions of 
performance at each score level. The recommended standard is the average 
estimated score across the various judges. 

3.4 Test-Level Pass/Fail Method 

In the Test-Level Pass/Fail Method judgements are made based on the 
complete set of examinee responses to all the constructed-response 
questions. Of course, for a one-item test, as Zieky (2001) points out, this 
method is equivalent to the Item-Level Pass/Fail Method previously 
described. In addition, Faggen (1994) mentions a variant of this method that 
incorporates the procedure of making ratings of an item dependent on the 
judgement for the previous response considered. 

3.5 Cluster Analysis Method 

The cluster analysis of test scores approach (Sireci, Robin, & Patelis, 
1999), although useful in identifying examinees with similar scores or 
profiles of scores, still leaves several problems unsolved. One is the issue of 
identifying the clusters that belong to the proficiency groups demanded by 
the standard setting framework of the test. Another unsolved question is the 
choice of method to apply in using the clusters to set the cutscores. 

3.6 Generalised Examinee-Centred Method 

In the generalised examinee-centred method (Cohen, Kane, & Crooks, 
1999) all of the scores in an exam are used to set the cutscores, with 
members of the standard-setting panel rating each performance on a scale 
linked to the standards that need to be set. The method, as described by 
Cohen et al. (1999) requires the participants “establish a functional 
relation... between the rating scale and the test score scale” (p.347). Then, 
the points identified on the rating scale that provide the definition of the 
category borders, are converted onto the score scale. This process generates 
cutscores for each of the category borders. Although this method has some 
advantages, such as the use of all the scores, and the use of an integrated 
analysis to generate all the cutscores, it becomes questionable when the 
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correlation between the ratings on the scale and the scores of the test is low 
(Zieky, 2001). 

As Jaeger (1994) points out, even these extensions to constructed- 
response standard-setting methodologies share the theoretical assumption 
that the tests to which they are applied are unidimensional in nature, and that 
the items of each test contribute to a summative scale (p. 3). Of course, we 
know that many of the instruments that are used in performance assessment, 
are of a very complex nature, and posses multidimensional structures that 
cannot be captured by a single score of examinee performance, derived in 
the traditional ways. 



4. THE OPTIMISED EXTENDED-RESPONSE 
STANDARD SETTING METHOD 

In order to also deal with multidimensional scales that can be found in 
extended response examinations the Optimised Extended-Response Standard 
Setting method (OER) was developed (Schmitt, 1999). The OER standard 
setting method uses well defined rating scales to determine the different 
scoring points where judges will estimate minimum passing points for each 
scale. Eor example, if an extended response item is to be assigned a 
maximum of 6 points, each judge is asked to evaluate how many examinees 
out of 100 (at each level of competence) would obtain a one, a two, a three, a 
four, a five, and a six (where the total number has to add to 100). Their 
ratings are then weighed by the rating scale and averaged across all possible 
points. This average across judges gives the minimum passing score for each 
level of competence, for the specific item. The average across all items gives 
the minimum passing score for each level of competence for the total test. In 
this way, even rating scales that are different by item can be used. In 
addition, multiple possible passing points or grades can also be estimated. 
This method thus provides flexibility based on well-defined rating scales. 
Once the rating scale is well defined, the judges are trained to evaluate 
minimum standards based on the corresponding rating scale. This insures 
consistency between the way standards are set and the way the scoring rubric 
is assigned. 

As with the Angoff Method, the OER standard setting method uses 
judgement of minimum proficiency. Because of this, the training of the 
judges and the standard setting process needs to be carefully conducted. We 
propose the following procedures to meet minimum standards in setting cut 
scores with the OER standard setting method. 
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4.1 The OER Standard Setting Method: Steps 



4.1.1 Selection of Judges 

The panel of judges selected should be large and provide the necessary 
diverse representation across variables such as: geography, culture, specific 
technical background, etc. Hambleton (2001) mentions that often 15 to 20 
panellists are used in typical USA state assessments. Several studies have 
addressed the relationship between number of raters and the reliability of the 
judgements (Maurer, Alexander, Callahan, Bailey, & Dambrot, 1991; 
Norcini, Shea, & Grosso, 1991; Hurtz and Hertz, 1999). Generalizability 
analyses have shown that usually a number of judges between 10 and 15 
produce phi-coefficients in the range of .80 and above. In most practical 
situations a minimum of 12 judges has been found necessary to provide 
reliable outcomes in setting standards (Schmitt, 1999). 

These judges should be selected from experienced professionals that are 
cognisant of the population they will make estimates about. For example, in 
a National Nursing Program, nurses who have between 5-10 years 
experience and are currently teaching students at the education level and in 
the particular content area to be tested, would be good candidates as judges 
for the standard setting session. If the test is to be administered nationally, 
the representation of the judges needs to also be national. Regional and/or 
personal idiosyncrasies in terms of content or standards should be avoided 
(this needs to be continually monitored during the standard setting process). 
Judges gender and ethnic representation should mimic, as much as possible, 
the profession or content area being tested. Invitations to potential judges 
should include a brief description of what is a standard setting, but should 
reassure them that the process will be carefully explained when they meet. 
This assures them that all information will be covered at the same time for 
all participants. 

4.1.2 Standard Setting Meeting 

All judges should meet together in one room. Under the current state-of- 
the-art conditions, having all judgements made at the same time, in the same 
setting, under standard conditions, and with opportunity to interact in a 
controlled environment, following exactly the same process, insures a 
minimum degree of standardisation. Although Fitzpatrick (1989) alerts to the 
potential problems with group dynamics, Kane (2001) points out that the 
substantial benefits of having the panellists consider their judgements 
together as a group far outweigh this risk. 
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Recently, several programs have instituted innovative processes where 
judges set standards though web-based “meetings” or other technology- 
based “meetings”. These web-based approaches to carry out standard- 
settings represent the case of the use of a new technology in the 
implementation of standard setting sessions. Although innovative, and 
possibly less short term costly, these distributed “cyber meetings” can 
produce results that are less reliable and consistent, putting more in question 
a process that is already judgmental in nature. Harvey and Way (1999), 
describe one such approach. It should be expected that many such 
applications will appear in the future, but the underlying issues regarding the 
method to be used in order to determine the necessary borders of the judged 
categories will remain the same, varying only in the nature of the 
implementation media. There is no doubt that in the future new advances 
might make it possible to carry out standard setting sessions at a distance, 
without loss of quality in the results. 

4.1.3 Explain the Standard Setting Process 

A description should be provided of what the standard setting is about, 
the particulars of the OER standard setting method, why the judges’ 
participation is so critical in determining minimum passing scores, and 
examples should be given of how judges will carry out the OER standard 
setting process. As an example, a computerised presentation can be 
developed to cover all major points of the process. This presentation should 
be basic and should not assume any prior knowledge of the standard setting 
process from any of the participants. Questions should always be welcomed, 
and should be answered fully. 

4.1.4 Provide Test Content & Rating Scale(s) 

A well-defined table of specifications where the test content is clearly 
outlined needs to be provided in all situations. The rating scale for each 
extended response question should be provided and explained. This should 
not be the moment, though, for revisions or changes. Nevertheless, if a major 
flaw in the rating scale is identified, revisions before starting the OER 
standard setting process need to be made. The clarity of the rating scale is 
paramount to the reliability of the scoring and the OER standard setting 
process. Therefore, the scoring criteria for each item, and why it is so, must 
be clearly established before the start of the standard setting session. 
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4.1.5 Define Competency Levels 

When determining minimum standards the judges need to understand the 
competency levels they will pass judgement on. In most 
licensure/certification programs, the minimum competency level to he 
evaluated is one. In these licensure/certification programs, the candidate 
either has the minimum requirements and passes, or does not meet these 
minimum requirements, and fails. These types of assessment programs 
require the judges to determine only one cut. Other programs where more 
distinctions in proficiency levels are needed may have several cut scores. 
Examples of such programs are: educational institutions that report exam 
results on multiple-grade scales (i.e., A-B-C-D-F), or assessment programs 
that report results on multiple performance levels (i.e., novice, apprentice, 
proficient, advanced). The following Conceptual Competency Graph 
provides a theoretical representation of different thresholds for a program 
with five competency levels. In this graph, each distribution shows the 
theoretical score ranges expected on a specific item, hy a group of examinees 
typical of each competency category. It is worth noting that the normal score 
distributions typically observed on tests, derive from the conceptual 
application of the Central Limit Theorem to multiple tasks, for the 
population of examinees, across the full ability range. In this example. 
Highly Competent corresponds to a grade of A (top score) and Not 
Competent corresponds to a grade of F (failing score), for every score below 
the D cut score. In this scale. Marginal corresponds to a grade of C, which 
would indicate the examinee to be minimally competent but just passing. 
The conceptualisation represented in the graph clearly indicates that the 
thresholds to be used by the judges are not midpoints of the distributions of 
proficiencies of the students in each of the proficiency levels. Rather, they 
represent a homogeneous set of potential students at the absolute minimum 
level of proficiency that could be classified within each of the categories. 

These students with the minimum level of proficiency for each category 
are described as the “borderline” examinees. 
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Figure L The Conceptual Competency Graph 

At the beginning of the standard setting session, descriptions of 
“borderline” test-takers at each of the competency levels must be developed 
by the group. A “borderline” test-taker is an examinee whose knowledge and 
skills are at the borderline or lowest level (threshold) of each competency 
level. These definitions of “borderline” are quite critical and need to be 
arrived at in complete agreement by consensus of the standard setting 
judges. This process helps the group to integrate, establish common 
baselines, and get to work as a group. Example definitions of Highly 
Competent, Competent, and Marginal are presented below: 
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Table 1. Definitions of Mastery Levels 



DEFINITIONS OF THRESHOLD 
MASTERY LEVELS 

HIGHLY COMPETENT (minimal A) 

Effectively communicates a thorough analysis and synthesis of all major 
concepts and themes in the question. 

• Explains relationships among concepts. 

• Provides evidence drawn from the course materials and/or the literature 
to support answer. 

• There required by the question, chooses a position on an issue and 
provides rational justification for that position. 

• Links theory to practice. 



COMPETENT (minimal B) 

Communicates effectively a fundamental analysis of most major 
concepts and themes in the question. 

• Defines relationships among concepts. 

• Provides some evidence drawn from the course to support the answer. 

• Where required by the question, chooses a position on an issue and 
provides partial justification for that position. 

• Where required by the question, links theory to practice. 



MARGINAL (minimal C - Passing) 

Communicates a limited understanding of the major concept(s) and 

theme(s) in the question. 

• Identifies concepts and provides a superficial description of the 
relationship(s). 

• Provides superficial/marginal evidence from the course to support 
answer. 

• Where required by the question, chooses a position on an issue and 
provides discussion with marginal justification. 

• When required by the question, superficially links theory to practice. 
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Table 2. Example Ratings of Item I Across Multiple Thresholds 



RATER DATE 

# 



CATEGORY: A - Highly Competent 



ITEM 


RATING POINTS | 


1 1 


2 


3 I 4 


5 


6 ! 


1 


I 0 


0 


10| 15 


50 





CATEGORY: B - Competent 



ITEM 


RATING POINTS | 


1 


2 3 


4 5 6 


1 


5 


10 15 


25 35 10 



CATEGORY: C- Marginal 



ITEM 


RATING POINTS | 


1 2 3 


in 


6 


1 


10 20 40 


20 10 


0 



4.1.6 Estimate Item Difficulty 

Judges are instructed to estimate the number of 100 hypothetical students 
at the threshold of each of the competence categories [e.g.: Highly 
Competent (A), Competent (B), Marginal (C)] who would get distributed 
across all possible rating points for an item. 

Example Instructions: 

“For each item estimate the percentage (or proportion) of 100 examinees 
in the specified category (A, B, or C) who would get each rating (1 to 6). 
Make sure that all 100 examinees are distributed across all possible 
points)”. 
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As indicated by the ratings in Table 2, the expectation is that in higher 
competency levels the 100 examinees would be distributed more heavily at 
the higher rating points of the scale. As the level of competency decreases, 
the distribution of the 100 examinees becomes more heavily represented in 
mid and later at lower levels of the scale. It has been found that it is 
important to carefully check that judges’ estimates add-up to 100 for each 
given distribution (Schmitt, 1999). 

4.1.7 Reaching Informed Consensus 

Another important element in the OER standard setting process is the 
process of reaching “informed consensus”. After each item is rated across 
thresholds, judges are asked to verbally specify what their ratings were and 
to justify them. The process begins by asking judges for their individual 
rating across all levels of competency. The most extreme estimates are noted 
and those judges representing those viewpoints are asked to explain the 
reasons for their extreme ratings. The group is encouraged to not pre-judge 
the reasonableness of the estimates and to keep an open mind for the 
diverging explanations. After these points have been expounded, all judges 
are given a chance to revise their estimates based on the explanations given 
beforehand. In this way, “informed consensus” is achieved, and overall 
variability between ratings is minimised. If judges chose to remain 
discrepant, the different viewpoints are respected. 

4.1.8 Maintain Parallel Standards 

To maintain parallel standards across thresholds and items it is important 
to ascertain the reasonableness of the distributions for the same item across 
different thresholds and of different items within a threshold. It is an 
important check of this method, that this “reasonableness test” should apply 
both across thresholds for the same item for the same judge, and also across 
items for each threshold, also for the same judge, and across judges. This 
“reasonableness” check is achieved by having the judges evaluate their 
ratings across different thresholds and items. The panel facilitator addresses 
discrepancies within each judge’s ratings, as well as those observed between 
members of the panel. The process involves active interaction between all 
participants, examining the justifications for each significantly discrepant 
score. The end result of the process is a new consensus which might or might 
not eliminate the discrepancies, but which has certainly examined and 
provided a basis for any possible remaining differences or adjustments 
carried out. Remaining differences are where the multidimensionality of the 
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scale can be correctly represented and adjusted for, making it possible for a 
final score to represent dimensional differences across items and thresholds. 

4.1,9 Setting Cut Scores for the Item and Test 

The probability distributions resulting from this procedure are used to set 
the cut scores for each examination by averaging the judges’ ratings across 
items at each competency level. An example of a summary page identifying 
cut scores for a test for three different thresholds is presented in Table 3. The 
average cut score point for each of four items is given at the far right column 
and the overall cut score is presented as either a cut score or percent correct 
across all four items and across all eight judges. In cases where discrepancies 
in standards across judges are determined to be too large, the rating for the 
outlier judge can be deleted and averages computed again with the remaining 
data. 

Table Standard Setting Score by Category (A, B, C) 



CATEGORY: A - Highly Competent 





1 RATERS 1 


AVERAGE 


ITEM# 


#1 


#2 


#3 


#4 


#5 


#6 


#7 


#8 


ALL 


1 


5 


6 


6 


6 


6 


6 


6 


6 


6 


2 


5 


6 


5 


5 


5 


5 


5 


5 


5 


3 


6 


5 


6 


4 


6 


5 


4 


4 


5 


4 


4 


4 


4 


5 


5 


4 


6 


6 


5 


massing 




















Score 


5 


5 


5 


5 


6 


5 


5 


5 


5 


Passing 

Percent 




88% 


88% 


83% 


92% 


83% 


^8^ 


88% 


86% 



CATEGORY: B - Competent 





1 PANELISTS 1 


AVERAGE 


ITEM# 


#1 


#2 


#3 


#4 


#5 


#6 


#7 


#8 


ALL 


1 


4 


5 


5 


5 


5 


5 


5 


5 


5 


2 


3 


5 


4 


4 


4 


4 


4 


4 


4 


3 


4 


4 


5 


3 


5 


4 


3 


3 


4 


4 


4 


3 


3 


4 


4 


3 


5 


5 


4 


h'assing 




















Score 


4 


4 


4 


4 


5 


4 


4 


4 


4 


Passing 

Percent 


63% 


71% 


71% 


67% 


75% 


67% 


71% 


71% 


69% 
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CATEGORY; C - Marginal 





1 PANELISTS 1 


AVEFIAGE 


ITEM# 


#1 


#2 


#3 


#4 


#5 


#6 


#7 


#8 


ALL 


1 


3 


4 


4 


4 


4 


4 


4 


4 


4 


2 


2 


4 


3 


3 


3 


3 


3 


3 


3 


3 


3 


3 


4 


2 


4 


3 


2 


2 


3 


4 


3 


2 


2 


3 


3 


2 


4 


4 


3 


massing 




















Score 


3 


3 


3 


3 


4 


3 


3 


3 


3 


Passing 




















Percent 


46% 


54% 


54% 


50% 


58% 


50% 


54% 


54% 


53% 



4.2 Application of the Optimised Extended Response 
Standard Setting Method 

The OER Standard Setting Method has been successfully implemented to 
higher education examinations, at the US graduate and undergraduate levels, 
with constructed-response items that involved extended writing in response 
to several prompts (Schmitt, 1999). These examinations had several items, 
many of them with different scales. The OER Standard Setting Method 
proved easy to explain to panellists/judges and flexible and accurate for the 
use with the different rating scales. Inter-judge reliability estimates (phi- 
coefficient) ranged between .80 and .94. Raymond and Reid (2001) consider 
desirable coefficients of .80 and greater. 

4.2.1 Outcome of Examinations 

The exams used as examples in this study were used to grant three 
semester hours of upper-level undergraduate credit to students who receive a 
score equivalent to a letter-grade of C or higher on the examination. An 
example of the table used to record the OER Standard Setting Method across 
thresholds is presented in Table 4. This Table could be used as a model for 
an implementation of the OER Standard Setting Method. 
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Table 4. Example of a form to record the OER Standard Setting across thresholds 

RATER DATE 

# 

CATEGORY: A - Highly Competent 



ITEM 


RATINGS 1 


1 


2 


3 


4 


5 


6 


1 














2 














3 














4 














5 















CATEGORY: B - Competent 



ITEM 


RATINGS 1 


1 


2 


3 


4 


5 


6 


1 














2 














3 














4 














5 















CATEGORY: C- Marginal 



ITEM 


RATINGS 1 


1 


2 


3 


4 


5 


6 1 


1 














2 














3 














4 














5 















For each item estimate the percentage (or proportion) of 100 examinees 
in the specified category (A, B, or C) who would get each rating. 



5. DISCUSSION 

The OER Standard Setting Method discussed presents a valuable option 
in order to deal with the multidimensional scales that are found in extended 
response examinations. As originally developed (Schmitt, 1999) this 
method’s well defined rating scales provide a reliable procedure to 
determine the different scoring points where judges estimate minimum 
passing points for each scale, which has been shown to work in various 
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examinations in different content areas in traditional paper-and-pencil 
administrations, as well as in computer delivered examinations (Cascallar, 
2000). As such it addresses the challenge presented hy the many 
complexities in the application of standards to constructed-response tests in a 
variety of settings and forms of administration. 

Recent conceptualisations, such as those differentiating between 
criterion- and construct-referenced assessments (William, 1997), present 
very interesting distinctions between the descriptions of levels and the 
domains. This method can integrate the conceptualisation, as providing both 
an adequate “description” of the levels, as attained by the consensus of the 
judges, as well as a flexible “exemplification” of the level inherent in the 
process to reach the consensus. As it has been pointed out, there is an 
essential need to estimate the procedural validity (Hambleton, Jaeger, Flake, 
& Mills, 2000; Kane, 1994) of judgement-based cutoff scores. This line of 
research will eventually lead to the most desirable techniques to guide the 
judges in providing their estimates of probability. In this endeavour the OER 
Standard Setting Method suggests a methodology and provides the 
procedures to maintain the necessary degree of consistency to make critical 
decisions that affect examinees in the different settings in which their 
performance is measured against the cut scores set using standard setting 
procedures. With reliability being a necessary but not sufficient condition for 
validity, it is necessary to investigate and establish valid methods for the 
setting of those cutoff points (Flake & Impara, 1996). 

The general uneasiness with the current standard setting methods 
(Fellegrino, Jones, & Mitchell, 1999) rests to a great extent on the fact that 
setting standards is ajudgement process that needs well-defined procedures, 
well-prepared judges, and the corresponding validity evidence, validity 
evidence is essential to reach the quality commensurate with the importance 
of its application in many settings (Hambleton, 2001). Ultimately, the setting 
of standards is a question of values and of the decision-making involved in 
the evaluation of the relative weight of the two types of errors of 
classification (Zieky, 2001). As there are no purely absolute standards, and 
problems are identified with various existing methods (Hambleton, 2001), it 
is imperative to remember and heed the often-cited words by Ebel (1972), 
which are still current today: 

Anyone who expects to discover the "real" passing score by any of these 
approaches, or any other approach, is doomed to disappointment, for a 
"real" passing score does not exist to be discovered. All any examining 
authority that must set passing scores can hope for, and all any of their 
examinees can ask, is that the basis for defining the passing score be 
defined clearly, and that the definition be as rational as possible, (p. 496) 
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It is expected that the OER Standard Setting Method will provide a better 
way to determine passing scores for extended response examinations where 
multidimensionality could be an issue and in this way provide a framework 
to more accurately capture the elements leading to quality standard setting 
processes, and ultimately to more reliable, fairer, and valid evaluation of 
knowledge. 
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1 . INTRODUCTION 

Formal assessment has always been a critical component of the 
educational process, determining entrance, promotion and graduation. In 
many countries (e.g. United States, England & Wales, Australia) assessment 
has assumed an even more salient place in governmental policy over the last 
two decades. Attention has focused on assessment because of its role in 
monitoring system functioning at different levels (students, teachers, schools 
and districts) for purposes of accountability and, potentially, spearheading 
school improvement efforts. At the same time, new information technologies 
have exploded on the world scene with enormous impact on all sectors of the 
economy, education not excepted. 

In the case of pre-college education, however, change (technological and 
otherwise) has come mostly at the margins. The core functions in most 
educational systems have not been much affected. The reasons include the 
pace at which technology has been introduced into schools, the organisation 
of technology resources (e.g. computer labs), poor technical support and lack 
of appropriate professional development for teachers. Accordingly, schools 
are more likely to introduce applications courses (e.g. word processing, 
spreadsheets) or “drill-and-kill” activities rather than finding imaginative 
ways of incorporating technology into classroom practice. While there are 
certainly many fine examples of using technology to enhance motivation and 
improve learning, very few have been scaled up to an appreciable degree. 

Notwithstanding the above, it is likely that the convergence of computers, 
multimedia and powerful communication networks will eventually leave 
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their mark on the world of education, and on assessment in particular. 
Certainly, technology has the potential to increase the value of and enhance 
access to assessment. It can also improve the efficiency of assessment 
processes. In this chapter, we present and explicate a structure for exploring 
the relationship between assessment and technology. While the structure 
should apply quite generally, we confine our discussion primarily to U.S. 
pre-college education. 

In our analysis, we distinguish between direct and indirect effects of 
technology. By the former we refer to the tools and affordances that change 
the practice of assessment and are the principal focus of attention in the 
research literature. Excellent examples are provided by Bennett (1998); 
Bennett (2001) and Bunderson, Inouye, and Olsen, (1989). In these studies 
the authors project how the exponential increase in available computing 
power and the advent of affordable high speed data networks will affect the 
design and delivery of tests, lead to novel features and, ultimately, to 
powerful new assessment systems that are more tightly coupled to 
instruction. 

There is noticeably less attention to what we might term the indirect 
effects of technology; that is, how technology helps to shape the political- 
economic context and market environment in which decisions about 
assessment take place. These decisions, concerning priorities and resource 
allocation, exert considerable influence on the evolution of assessment. 
Indeed, one can argue that while science and technology give rise to an 
infinite variety of possible assessment futures, it is the forces at play in the 
larger environment that determine which of these futures is actually realised. 
For this reason, it is important for educators to appreciate the different ways 
in which technology can and will influence assessment. With a deeper 
understanding of this relationship, they will be better prepared to help 
society harness technology in ways that are educationally productive. 
Bennett (2002) offers a closely related analysis of the relationship between 
assessment and technology, with a detailed discussion of current 
developments in the U.S. along with informed speculation about the future. 

Section 2 begins by presenting a framework for the study of assessment. 
In Section 3 we explore the direct effects of technology, followed in Section 
4 by an analysis of how technology can contribute to assessment quality. 
Section 5 discusses the indirect effects of technology and Section 6 the 
relationship between assessment purpose and technology. The final Section 
7 offers some conclusions. 
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2. A FRAMEWORK FOR ANALYSIS 

Braun (2000) proposed a framework to facilitate the analysis of forces 
like technology, which shape the practice of assessment. The framework 
comprises three dimensions: Context, Purpose and Assets. (See Figure 1.) 




Figure I. Three dimensions of assessment 

Context refers to: (i) the physical, cultural or virtual environment in 
which assessment takes place; (ii) the providers and consumers of 
assessment services; and (iii) any relevant political and economic 
considerations. (See Figure 2.) For example, the microenvironment may 
range from a typical fourth grade classroom in a traditional public school to 
the bedroom of a home-schooled high school student taking an online 
physics course. 
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Figure 2. The context dimension of assessment 

In the former case, the providers range from the teacher administering an 
in-class test to the publisher of a standardised end-of-year assessment to be 
used for purposes of accountability. For the in-class test, the student and the 
teacher are the primary consumers, while for the end-of-year assessment the 
primary consumers are the school and government officials as well as the 
public at large. In the latter case, the provider is likely to be some 
combination of a for-profit company and the school faculty while the 
primary consumer is the student and her parent. 

The macroenvironment is largely characterised by political and economic 
considerations. The rewards and sanctions (if any) attached to the end-of- 
year assessment, along with the funding allocated to it, will shape the 
assessment program and its impact on classroom activities. In the online 
course, institutional interest in both reducing student attrition and 
establishing the credibility of the program will influence the nature of the 
assessments employed. 

The second dimension, purpose, also has three aspects: Choose, Learn 
and Qualify. (See Figure 3.) The data from an assessment can be used to 
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choose a program of study or a particular course within a program. Other 
assessments serve learning by providing information that can be used by the 
student to track progress or diagnose strengths and weaknesses. Finally 
assessments can determine whether the student obtains a certificate or other 
qualification that enables them to attain their goals. Although these purposes 
are quite distinct, a single assessment may well serve multiple purposes. For 
example, results from a selection test can sometimes be used to guide 
instruction, while a portfolio of student work culled from assessments 
conducted during a course of study can inform a decision about whether the 
student should receive a passing grade or a certificate of completion. 




Figure 3. The purpose dimension of assessment 

In classroom settings, external tests (alone or in conjunction with 
classroom performance) are sometimes used for tracking purposes. Teachers 
will employ informal assessments during the course of the year to inform 
instruction and may subscribe to services offered by commercial firms in 
order to enable their students to practice for the end-of-year assessment, 
which relates to the “Qualify” aspect of purpose. Typically, governmental 
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initiatives in assessment focus on this third aspect with the aim of 
establishing a basis for accountability. 

In the case of online learning, students may employ preliminary 
assessments to decide if they are ready to enter the program. Later they will 
use assessments to monitor their progress in different classes and 
subsequently sit for course exams to determine whether they have passed or 
failed. 

The third dimension, assets, represents what developers bring to bear on 
the design, development and implementation of an assessment. It consists of 
three components: Disciplinary knowledge, cognitive/measurement science 
and infrastructure. (See Figure 4.) The first refers to the subject matter 
knowledge (declarative, procedural and conceptual) that is the focus of 
instruction. The second component refers to the understandings, models and 
methods of both cognitive science and measurement science that are relevant 
to test construction and test analysis. Finally, infrastructure comprises the 
systems of production and use that support the assessment program. These 
systems include the hardware, software, tools and databases that are needed 
to carry out the work of the program. 

In the following sections we will examine the effects of technology on 
assessment. We will be concerned with not only whether a particular set of 
technologies has or can have an impact on assessment practice but also its 
contribution to assessment quality. How is quality defined? We posit that 
assessment quality has two essential aspects, denoted by validity and 
efficiency. 

The term validity encompasses psychometric characteristics (i.e. 
accuracy and reliability), construct validity (i.e. whether the test actually 
measures what it purports to measure) and systemic validity (i.e. its effect on 
the educational system). Efficiency refers to the monetary cost and time 
involved in production, administration and reporting. While cost and time 
are usually closely linked, they are sometimes distinct enough to warrant 
separate consideration. As we shall see below, validity and efficiency often 
represent countervailing considerations in assessment design, with the 
balance point determined by both context and purpose. 
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Figure 4. The assets dimension of assessment 

Indeed, quality is a multidimensional construct, which can legitimately 
be viewed differently by different stakeholders. For example, governmental 
decision makers often give primacy to the demands of efficiency over 
considerations of validity, particularly when the “more valid” solutions are 
not readily available or require longer time horizons. Educators, on the other 
hand, typically focus on validity concerns although they are certainly not 
indifferent to issues of cost and time. These conflicting views play out over 
time in many different settings. 



3. THE DIRECT EEEECTS OE TECHNOLOGY 

In view of the definition of the direct effects of technology offered in the 
introduction, they are best studied by consideration of the infrastructure 
component of the Assets dimension. To appreciate the impact of technology, 
we require, at least at a schematic level, a model of the process for 
assessment design and implementation. This is presented in Figure 5. 
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Somewhat different and more elaborated models can be found in Bachman 
and Palmer (1996) and Mislevy, Steinberg, and Almond (2002). 




Figure 5. A model of the process for assessment design and implementation. 

The first phase of the process leads to the identification of the design 
space; that is, the set of feasible assessment designs. The design space is 
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determined by the “three C’s”: Constructs, Claims and Constraints. 
Constructs are the targets of inference drawn from the substantive field in 
question, while claims are the operational specifications of those inference 
targets expressed in terms of what the students know or can do. Constraints 
are the limitations (physical, monetary, temporal) that must be taken into 
account by any acceptable design. 

Once the boundaries of the design space are known, different designs can 
be generated, examined and revised through a number of cycles, until a final 
design is obtained. (At this stage, operational issues can also be addressed.) 
With a putative final design in hand, construction of the instrument can 
begin. In practice, this work is usually embedded in a larger systems 
development effort that supports subsequent activities including test 
administration, analysis and reporting. 

Technology may exert its influence at all phases of the process. For 
example, the transition from constructs to claims can be very time 
consuming, often requiring substantial knowledge engineering. Shute and 
her collaborators (Shute, Torreano & Willis, 2000; Shute & Torreano, 2001) 
have developed software that, in preliminary trials, has markedly reduced the 
time required to elicit and organise expert knowledge for both assessment 
and instructional purposes. Future versions should yield improved coverage 
and fidelity of the claims as well, resulting in both increased efficiency and 
enhanced validity. 

Shifting attention to the implementation phases, technology makes 
possible tools that result in the automation, standardisation and enhancement 
of different assessment processes, rendering them more efficient. Increased 
efficiency, in turn, yields a greater number of feasible designs, some of 
which may yield greater validity. Item development and the scoring of 
constructed responses illustrate the point. 

One of the more time consuming tasks in the assessment process is the 
development of test items. Even in testing organisations with large cadres of 
experienced developers, creating vast pools of items meeting rigorous 
specifications is an expensive undertaking. Over the last five years, tools for 
assisting developers in item generation have come into use, improving 
efficiency by factors of 10 to 20. In the near future, these tools will be 
enhanced so that developers can build items with specified psychometric 
characteristics. This will result in further gains in efficiency as well as some 
improvement in the quality of the item pools. Eventually, item libraries will 
consist of shells or templates with items generated on demand by selecting a 
specific template along with the appropriate parameters (Bejar, 2002). 
Beyond further improvements in efficiency, item generation on demand has 
implications both for the security of high stakes tests and the feasibility of 
offering customised practice tests in instructional settings. 
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Test questions that require the student to produce a response are often 
regarded as more desirable than multiple choice questions (Mitchell, 1992). 
Grading the responses, however, imposes administrative and financial 
burdens that are often prohibitive. Arguably, the dominance of the multiple- 
choice format is due in large part to its cost advantage as well as its 
contribution to score reliability. In the early 90’s, the advent of computer- 
delivered tests heralded the possibility of reducing reliance on multiple 
choice items. It became apparent that it would be necessary to develop 
systems to automatically score student constmcted responses. In fact, expert 
systems to analyse and evaluate even complex products such as architectural 
drawings and prose essays as well as mathematical expressions have been 
put into operation (Bejar & Braun, 1999; Burstein, Wolff, and Lu, (1999). 

Automated scoring yields substantial improvements in cost and time over 
human scoring, usually with no diminution accuracy. In the course of 
developing such systems, a more rigorous approach to both question 
development and response scoring proves essential and yields, as an 
ancillary benefit, modest improvements in test validity. This argument was 
developed by Bejar and Braun (1994) in the context of architectural 
licensure but holds more generally. 

Computer delivery is perhaps the most obvious application of technology 
to assessment, making possible a host of innovations including the 
presentation of multimedia stimulus materials and the recording of responses 
in different sensory modalities. (See Bennett (1998) for further discussion.) 
In conjunction with automated scoring capabilities, this makes practical a 
broad range of performance assessments. 

With sufficient computing power and fast networks, various levels of 
interactivity are possible. These range from adaptive testing algorithms 
(Swanson and Stocking, 1993) to implementations of Bayes inference 
networks that can dynamically update cognitively grounded student profiles 
and suggest appropriate tasks or activities for follow up (Mislevy, Almond & 
Steinberg, 1999). The paradigmatic applications are the complex interactive 
simulations or intelligent tutoring systems that have been developed by such 
companies as Maxis and Cognitive Arts for the entertainment or 
education/training markets. 

The power of technology is magnified when individual tools are 
organised into coherent combinations that can accomplish multiple tasks. 
Gains in efficiency can then be realised both at the task and the system 
levels. An early instance is the infrastructure that was built to develop and 
deliver computer-based architectural simulations that are part of a battery of 
tests used for the registration (licensure) of architects in the U.S. and Canada 
(Bejar & Braun, 1999). The battery includes fifteen different types of 
simulations, each one intended to elicit from candidates a variety of complex 
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graphical responses. Furthermore, to meet the client’s targets for cost 
savings, these responses have to be scored automatically without human 
intervention. 

The core of the infrastructure comprises a set of interrelated tools that 
supports authoring of multiple instances of each simulation type, efficient 
rendering of the geometric objects for each such instance, delivery of the 
simulations, capture of the candidates’ data as well as the analysis and 
evaluation of the responses. In comparison to the previous paper-and-pencil 
system, the current system achieves greater fidelity with respect to the 
constructs of interest and more uniformity in scoring accuracy, while 
possessing superior psychometric properties. (The latter include better 
comparability of test forms over time and higher reliability of classification 
decisions.) It also offers candidates significantly more opportunities to sit for 
the test and much more rapid reporting of results. This situation stands in 
stark contrast with most high stakes paper-and-pencil assessments 
(particularly those incorporating some constructed response questions), 
which are offered only once or twice a year with results reported months 
after the administration. 

A later and more sophisticated example is the infrastructure developed by 
Mislevy and his associates (Mislevy, Steinberg, Breyer, Almond & Johnson, 
1999) to facilitate the implementation of a general assessment design 
process, evidence centred design. The infrastructure comprises both design 
objects and delivery process components. Design objects are employed to 
build student models and task models. In addition there are systems to 
extract evidence from student work and to dynamically update the student 
model on the basis of that evidence. Finally, there are software components 
that support task selection, task presentation, evidence identification and 
evidence accumulation. The idea is to build flexible modular systems with 
reusable components that can be configured in different ways to efficiently 
support a wide variety of assessment designs. 



4. ASSESSMENT QUALITY 

How does a technology-driven infrastructure contribute to the quality of 
an assessment system? Recall that we asserted that quality comprises two 
aspects, validity and efficiency. In some settings, we can improve efficiency 
but with little effect on validity. This is illustrated by the use of item 
generation tools to reduce the cost of developing multiple choice items for 
tests of fixed design. In other settings, we can enhance validity but with little 
effect on efficiency. An obvious case in point is lengthening a test by the 
addition of items in order to gain better coverage of the target constructs. 
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Validity is improved but at the price of some additional cost and testing 
time. 

Improvements to efficiency or validity (but not both) are typical of the 
choices that test designers must make. That is, there is a trade-off between 
validity and efficiency. Consider that for reasons of coverage and fairness, 
test designs employing different item types are generally preferred to those 
that only use multiple choice items. However, incorporating performance 
items generally raises costs by an order of magnitude and therefore tends to 
preclude their inclusion in the final design. Such considerations are often 
paramount in public school settings, where the cost of hand scoring many 
thousands of essays or mathematical problems, as well as the associated time 
delays, are quite burdensome. Indeed, under the pressure of increased testing 
for accountability, the states of Florida and Maryland recently announced 
that they were reducing or eliminating the use of performance assessments in 
their end-of-year testing programs. 

Technology can make a critical contribution by facilitating the attainment 
of a more satisfactory balance point between validity and efficiency. This 
parallels the notion that technology can render obsolete the traditional trade- 
off between “richness and reach”. Popular in some of the recent business 
literature (Evans & Wurster, 2000), the argument is that in the past one had 
to choose between offering a rich experience to a select few and delivering a 
comparatively thin experience to a larger group. Imagine, for example, the 
difference between watching a play in the theatre and reading it in a book. 
Through video technology a much larger audience can watch the play, 
enjoying a much richer experience than that of the reader — though perhaps 
not quite as rich as that of the playgoer. The promise of the new, technology- 
mediated trade-off between richness and reach was responsible (at least in 
part) for the explosion of investment in education-related internet 
companies. The subsequent collapse of the “dot.com bubble” does not render 
invalid the seminal insight but, rather, is more a reflection of the difficulty in 
creating new markets and the unavoidable consequence of “irrational 
exuberance”. 

In the realm of assessment, the contribution of technology stems from its 
potential to substantially enhance the efficiency of certain key processes. 
This, in turn, can dramatically increase the set of feasible designs, resulting 
in a final design of greater validity. Two striking examples are the automated 
scoring of (complex) student responses and the implementation of adaptive 
assessment. Consider the following examples. 

A number of essay grading systems are now available, including e-rater 
(Burstein et al.,1999) and KAT (Wolfe, Schreiner,, Redder, Laham, Foltz, & 
Landauer, 1998). Although based on different methodologies, they are all 
able to carry out the analysis and evaluation of student essays of several 
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hundred words. In the case of e-rater, the graded essays are responses to one 
of a set of pre-determined prompts on which the system has been trained. In 
its current version, e-rater not only provides scale scores and general 
feedback but also detailed diagnostics linked to the individual student’s 
writing. 

A computer-based adaptive mathematics assessment, ALEKS, developed 
by Falmagne and associates (Falmagne, Doignon, Koppen, Villano, & 
Johannesen, 1990; See also http://www.aleks.com), is able in a relatively 
short time to place a student along a continuum of development in pre- 
coUegiate mathematics curriculum. Based on many years of research in 
cognitive psychology and mathematics education, AFEKS enables a teacher 
or counsellor to direct the student to an appropriate class and supports the 
setting of initial instructional goals. 

Intelligent tutoring systems (Snow & Mandinach, 1999; Wenger, 1987) 
are intended to provide adaptive support to learners. They have been built for 
a variety of content areas such as geometry (Anderson, Boyle, & Yost, 
1985), computer programming (Anderson, & Reiser, 1984) as well as 
electronic troubleshooting (Fesgold, Fajoie, Bunzo, & Egan, 1992). A 
related example at ETS is Hydrive (Mislevy & Gitomer, 1996) which was 
built to support the training of flight-line mechanics on the hydraulic systems 
of the F-15. It enables the trainees to develop and hone their skills on a large 
library of troubleshooting problems. They are given wide scope in how to 
approach each problem as well as a choice in the level and type of feedback. 
Although they are not aware of it, the feedback is based on a comprehensive 
cognitive analysis of the domain and a sophisticated psychometric model 
(matched to the cognitive analysis), which is dynamically updated with each 
action the student takes. 

Intelligent tutoring systems in the workplace can be very efficient in 
terms of the utilisation of expensive equipment. Apprentices can be required 
to meet certain performance standards before being allowed to work on 
actual equipment, such as jet aircraft. When they do, they are much more 
likely to profit from the experience. Suppose, in addition, that the problem 
library is constantly updated to reflect the problems faced by experts as new 
or modified equipment comes on line. Students then have the benefit of 
being trained on exactly the sort of work they will be expected to do when 
they graduate, largely eliminating the notorious transfer of training problem. 
Such a seamless transition from training to work is exceptionally rare, at 
least in the U.S. 

What these examples have in common is that they illustrate how 
technology can provide large populations of learners with access to 
assessment services that, until recently, were only available to the few 
students blessed with exceptional teachers. In fact, for some purposes, these 
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systems (or ones available shortly) are not surpassed by even the best 
teachers. Consider that, with e-rater, students can write essentially as many 
essays as they like, revise them as often as they want, whenever they wish — 
and each time receive immediate, detailed feedback. Moreover, it is possible 
to receive scale scores based on different sets of standards, provided that the 
system has been trained to those standards. High school students could then 
be graded according to the standards set by the English teachers in their 
school, all English teachers in their system or even those set by the English 
teachers at the state college that many hope to attend. The juxtaposition of 
these scores can be instructive for the student and, in the aggregate, serve as 
an impetus for professional development for the secondary school teachers. 

If these developments continue to ramify (depending, in part, on 
sufficient investments), technology will emerge as a force for the 
democratisation of assessment. It will facilitate the use of more valid 
assessments in a broader range of settings than heretofore possible. This 
increase in validity will usually be obtained with no loss of efficiency and, 
perhaps, even with gains in efficiency if a sufficiently long time horizon is 
used. 



5. INDIRECT EFFECTS OF TECHNOLOGY 

Our analysis begins by returning to the framework of Eigure 4. While the 
discussion in Section 3 focused on the infrastructure aspect of the Assets 
dimension, consideration of the indirect effects of technology begins with 
the other two aspects of the Assets dimension. 

It is evident that technology - in the form of tool sets and computing 
power - has an enormous effect on the development of many disciplines, 
especially the sciences and engineering. To cite but two recent examples: (i) 
The development of machines capable of rapid gene sequencing were critical 
to the successful completion of the Human Genome Project; (ii) The Hubble 
Space Telescope has revealed unprecedented details of familiar celestial 
objects as well as entirely new ones that have led to advances in cosmology. 
Experimental breakthroughs lead, over time, to reconceptualisations of a 
discipline and, ultimately, to new targets for assessment. 

Similarly, developments in cognitive science, particularly in 
understanding how people learn, gradually influence the constructs and 
models that impact the design of assessments for learning (Pellegrino, 
Chudowski, & Glaser, 2001). Again, these advances depend in part on 
imaging technology, though this is not the major impetus for the evolution of 
the field. However, these developments do influence measurement models 
that are proposed to capture the salient features of these learning theories. Of 
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course, technology plays a critical role in making these more complex 
measurement models practical — but this brings us back to technology’s 
direct effects! In short, through its impact on the different disciplines, 
technology influences the constructs, claims and models that, in turn, shape 
the practice of assessment. 

Returning now to the framework of Figure 2, we examine the indirect 
effects of technology through the prism of Context. There is a complex 
interplay between technology on the one hand and political-economic forces 
on the other. One view is that new technologies can stimulate political and 
economic actions that, in turn, influence the educational environment. In 
most countries, the prospect of increasing economic competitiveness has 
spurred considerable governmental investment in computers and 
communication technology for schools. In the U.S., for example, over the 
last decade states have made considerable progress toward the goal of 
substantially reducing the pupil-computer ratio. In addition, the U.S. 
government through its E-rate program has subsidised the establishment of 
internet connections for public schools. Indeed, both at the state and federal 
levels, there is the hope that the aggressive pursuit of a technology in 
education policy can begin to equahse resources and, eventually, 
achievement across schools. That is, again, that technology can act as a 
democratising force in education. 

We have already made the point that the promise of technology to effect 
a new trade-off between richness and reach fuelled interest in internet-based 
education companies. Hundreds of such companies were founded and 
attracted, in the aggregate, billions of dollars in capital. While many offered 
innovative products and services, few were successful in establishing a 
viable business. Now that most of those companies have failed or faltered 
and been bought out, we are left with just a handful of behemoths astride the 
landscape. 

One consequence is that in the U.S. four multinational publishers 
dominate the education market with ambitious plans to provide integrated 
suites of products and services to their customers. If they are successful, they 
will gain unprecedented influence over the conduct of pre-college education. 
In particular, assessment offerings are much more likely to be determined by 
financial calculations based on the entire suite to be offered. For example, a 
firm with a large library of multiple choice items on hand will naturally want 
to leverage that asset by offering those same items in an electronic format. 
Moreover, it will be able to offer the prospective buyer much more 
favourable terms than would be the case if it had to build a new pool of 
items incorporating performance assessment tasks and the accompanying 
scoring systems. In the U.S., at least, these conservative tendencies will be 
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reinforced by the pressure to deliver practice tests for the high stakes 
outcome assessments that are now mandated by law. 

Thus, technology interacts with the political, economic and market forces 
that help to shape the environments in which assessment takes place. In the 
case of instructional assessment, the presence of an internet connection in a 
classroom expands the range of such assessments that can be delivered to the 
students. For the moment, the impact on the individual student is limited 
both by the amount of time they can actually access the internet and the 
quality of the materials available. The first issue, which we can term 
“effective connectivity”, will diminish with the continuing expansion of 
traditional computer resources and the increasing penetration of wireless 
technology. The latter issue is more problematic. In U.S. public education 
systems there is considerable pressure to focus such assessment activities on 
preparing students to take end-of-year tests tied to state-level accountability. 
Consequently, they tend to rely on standard multiple-choice items that target 
basic competencies. 

In the case of assessment for other purposes, the situation is somewhat 
different. The existence of a network of testing centres with secure electronic 
transmission capabilities makes possible the delivery of high stakes 
computer-based tests for both selection and qualification. Examples of the 
former (for college and beyond) are the GRE, GMAT and TOEFL. 
Examples (in the U.S.) of the latter are tests for professional licensure in 
medicine and architecture. Internationally, certifications in various IT 
specialities are routinely delivered on computer. It should be noted, however, 
that with the notable exceptions of medicine and architecture, the formats 
and content of these assessments have not changed much with computer 
delivery. High stakes assessments are subject to many (conservative) forces 
that appear to make real change difficult to accomplish. 

Technology can shape the design space by contributing to the creation of 
novel learning environments (e.g. e-learning), where the constraints and 
affordances are quite different from those in traditional settings. Such 
environments pose entirely different challenges and opportunities. In 
principle, assessments integrated with instruction can play a more salient 
role given concerns about attrition rates and academic achievement in online 
courses. Presumably, students in such courses have uninterrupted internet 
access so that they can access on-demand assessment at any time. 
Assessments that provide rich feedback and support for instruction would be 
especially valuable in such settings. There is little evidence, however, that 
assessment providers have risen to the occasion and developed portfolios of 
diverse assessment instruments that begin to meet these needs. Many e- 
learning environments offer chat rooms, threaded discussion capabilities and 
the like. Consequently, students can participate in multiple conversations. 
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review archival transcripts and communicate asynchronously with teachers. 
These settings offer opportunities for collecting and analyzing data on the 
nature and patterns of interactions among students and teachers. Again, there 
is little evidence that such assessments are being undertaken. 

It is also necessary to expand the notion of the virtual environment to 
include the cultural milieu of today’s students as well as the work world of 
their parents and grandparents. Technology plays an important role in 
providing a seemingly endless stream of new tools (toys) to which 
individuals become accustomed in their daily lives at home and in the 
workplace. These range from cell phones and palm pilots to computers (and 
the sophisticated software applications that have been developed for them). 
Bennett (2002) discusses at length how the continuing infusion of new 
technology into the workplace, and the increasing importance of technology- 
related skills, places enormous pressure on schools both to incorporate 
technology into the curriculum and to enhance its role as a tool for 
instmction and assessment. 

Another, equally fundamental, question is how children who grow up in 
an environment filled with continuous streams of audio-visual stimulation, 
and who may spend considerable time in a variety of virtual worlds, 
understand and represent reality, develop patterns of cognition and learning 
preferences. Some of these issues have been investigated by Turkle (1984, 
1995). Increasingly, this will pose a challenge to both instruction and 
assessment. Teachers and educational software developers will have to take 
note of these (r)evolutionary trends if they are to be effective and credible in 
the classroom. 



6. TECHNOLOGY AND PURPOSE 

Although purpose is not directly impacted by technology, context does 
influence the relative importance of the different kinds of purpose - choose, 
learn or qualify. While all three are essential and will continue to evolve 
under the influence of technology and other forces, political and econo mi c 
considerations will usually determine which one is paramount at a particular 
time and place. For a particular purpose, these same considerations will 
govern the direction and pace of technological change. As we have already 
argued, advances in the technology (and the science) of assessment make 
possible new trade-offs between validity and efficiency with respect to that 
purpose. However, the decisions actually taken may well leave the balance 
point unchanged, favour improvements in efficiency over those in validity. 



or vice versa. 
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This assertion is illustrated by the current focus in the U.S. on 
accountability testing, which has already been alluded to. Concern with the 
productivity of public education has led to a rare bi-partisan agreement on 
the desirability of rigorous, standards-based testing in several subjects at the 
end of each school year. As a result, hundreds of millions of dollars will be 
allocated to the states for the development and administration of these 
examinations and for the National Assessment of Educational Progress that 
is to be used (in part) to monitor the states’ accountability efforts. Moreover, 
it appears that, in the short run at least, concerns with efficiency will take 
precedence over improvements in validity. 

Recent surveys suggest that the U.S. public is generally in agreement 
with their elected officials. On the other hand, many educators and 
educational researchers have expressed grave concern about the trend to 
increased testing. They assert that increased reliance on high stakes external 
assessments will only stifle innovation and lead to lowered productivity. In 
other words, they are questioning the systemic validity (Frederiksen, & 
Collins, 1989) of such tests. Moreover, they argue that the effectiveness of 
high stakes tests as a tool for reform is greatly exaggerated and that these 
tests are favoured by those in government and business only because the 
required investments are relatively small in comparison to those required for 
improvements to physical plant or for meaningful changes in curriculum and 
instruction. See for example Amrein and Berliner (2002), Kohn (2000) and 
Mehrens (1998). Among those who support a greater role for technology in 
education, there is concern that the current environment presents many 
impediments to rapid progress, at least in the near term (Solomon & Schrum, 
2002). In addition to the emphasis on high stakes testing, they cite economic, 
philosophical and empirical factors. 

At bottom, there are strongly opposing views about the kinds of tests and 
types of technology that are most needed and, therefore, what is the most 
effective way to allocate scarce resources. How this conflict is resolved, and 
the choices that are made, will indeed determine which assessment 
technologies are developed, implemented and brought to scale. 



7. CONCLUSIONS 

The framework and corresponding analysis that have been presented 
strongly support the contention that technology has enormous potential to 
influence assessment. In brief, we have argued that the evolution of 
assessment practice results from the dynamic interplay between the demands 
on assessment imposed by a world that is shaped to some degree by 
technology and the range of possible futures for assessment made possible 
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by the tools and systems that are the direct products of technology. Salomon 
and Almog (2000) have made an analogous argument with respect to the 
relationship between technology and educational psychology. 

It is important to consider the negative impact technology can have on 
education generally and assessment practice, specifically. Policy makers see 
technology as a “quick fix” to the ills of education. Resource allocation is 
then skewed toward capital expenditures but, typically, without sufficient 
concomitant investments in technical support, teacher training and 
curriculum development. One consequence is inefficient usage (high 
opportunity costs), accompanied by demoralisation of the teaching force. 
Similarly, new technologies, such as item generation, can make multiple 
choice items - in comparison to performance assessments — very attractive 
to test publishers and state education departments, with the result that 
assessment continues to be seen as the primary engine of change. 

Too often, discussions of the promise of technology dwell on how it 
makes possible “new modes of assessment”. For the most part, that is a 
misnomer. More typically, the role of technology is to facilitate the broader 
dissemination of assessment practices that were heretofore reserved for the 
fortunate few. Intelligent tutoring systems, for example, are often regarded 
as incorporating new modes of instruction and assessment. However, the 
notion that a learner ought to have the benefit of a teacher who has mastered 
all relevant knowledge, applies state-of-the-art pedagogical principles and 
adapts appropriately to the student’s learning style, is not a new one. In point 
of fact, it is probably not very far from what Philip of Macedon had in mind 
when he hired Aristotle to tutor his son Alexander. Perhaps we should 
reserve the term “new modes” for such innovations as virtual reality 
simulations or assessments of an individual’s ability to carry out information 
search on the web. Some other examples are found in Baker (2000). 

It may be more productive to speculate on how the needs and interests of 
a technology-driven world may lead to new views of the nature and function 
of assessment. At the same time, further developments in cognitive science, 
educational psychology and psychometrics, along with new tool systems, 
undoubtedly will contribute to the evolution of assessment practice. 

While educators and educational researchers are not ordinarily in a 
position to influence the macro trends that shape the environment, they are 
not without relevant skills and resources. They ought to cultivate a broader 
perspective on assessment and a deeper understanding of the interplay 
between context, purpose and assets. That understanding can help them to 
better predict trends in education generally and anticipate future demands on 
assessment, in particular. It may then be possible to formulate responses that 
substantially improve both validity and efficiency. Such responses may 
involve novel applications of existing technology or even the development 
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of entirely new methods and the corresponding technologies for their 
implementation. 

Those who are concerned about how assessment develops under the 
influence of technology and other forces must keep their eye on the potential 
of technology to democratise the practice of assessment. With validity and 
equity as lodestars, it is more likely that the power of technology can be 
harnessed in the service of humane educational ends. 
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